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Abstract. This paper describes a novel theoretical characterization of the performance of non-local means 
(NLM) for noise removal. NLM has proven effective in a variety of empirical studies, but little is 
understood fundamentally about how it performs relative to classical methods based on wavelets 
or how its parameters should be chosen. For cartoon images and images which may contain thin 
features and regular textures, the error decay rates of NLM are derived and compared with those 
of linear filtering, oracle estimators, Yaroslavsky's filter and wavelet thresholding estimators. The 
trade-off between global and local search for matching patches is examined, and the bias reduction 
associated with the local polynomial regression version of NLM is analyzed. The theoretical results 
are validated via simulations for 2D images corrupted by additive white Gaussian noise. 

Key words. Non-local means (NLM), Yaroslavsky's filter, kernel smoothing, patch-based methods, local poly- 
nomial regression, oracle bounds, minimax bounds, cartoon model, textures. 

1. Introduction. The classical problem of image noise removal has drawn significant at- 
tention during the past few decades from the image processing, computational harmonic anal- 
ysis, nonlinear approximation, and statistics communities. In recent years there has been 
a resurgence of interest in kernel-based methods, including the ubiquitous non-local means 
(NLM) algorithm [6], due to their practical efficacy on broad collections of "natural" images. 
While there is a wealth of theoretical analysis associated with nonlinear thresholding estima- 
tors based on wavelets and related sparse multiscale representations of images [13, 14, 35, 51] 
or on diffusion models [52, 47] and partial differential equations [39, 1], performance guarantees 
for NLM are lacking and this paper aims at providing some results in this direction. 

In this paper, we explore the theoretical underpinnings of adaptive kernel-based image 
estimation and derive bounds on the mean squared error as a function of the number of pixels 
observed and features of the underlying image. The denoising methods we consider are based 
on estimating each pixel value with a weighted sum of the surrounding pixels. Depending on 
how the weights in this average are selected, this corresponds to classical linear filters [38, 60], 
Yaroslavsky's filter (YF) [62], the Sigma filter [28], or the bilateral filter [55]. It also includes 
variable-bandwidth kernel estimators [29], referred to as Lepski's method by statisticians and 
as the Intersection of Confidence Intervals (ICI) rule [22, 23] in signal processing. Other 
variants for a local choice of the kernel include [49, 53] We refer to [20, 43, 36] for more 
insights on a unifying framework for averaging filters. 

As none of these methods have been explicitly designed to deal with textured regions, 
many authors, inspired by work on texture synthesis [16] and inpainting [8], have proposed 
to introduce patches (small sub-images) to take advantage of natural image redundancy, es- 
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pecially in textured regions. NLM [6] and UINTA [3] algorithms are typical examples of this 
approach, as is their extension using Lepski's method [26]. Those algorithms rely on averag- 
ing similar pixels, where the similarity is measured through patches centered on the pixel of 
interest. Some more elaborate methods have tried to remove artifacts appearing in regions 
with low redundancy [45] - a phenomenon also known as the rare patch effect [15] - for 
instance by choosing NLM parameters automatically and locally. A common tool used for 
this local adaptivity is the Stein Unbiased Risk Estimate (SURE) [15, 58, 59]. 

Most current state-of-the-art methods for denoising take advantage of the patch framework 
[32, 9, 10]. The interested reader could get a clear picture of practical performance of those 
recent methods, in the review paper by Katkovnik et al. [25]. Despite the strong empirical 
performance of these methods, few performance guarantees exist: bounds with information 
theory flavor are derived in [61] for a simple version of NLM; a consistency result relying on 
beta-mixing assumptions on the image and on the noise (both modeled as random variables) 
is obtained in [5, 6]; [47] proposes a graph-diffusion interpretation for a simple image model; a 
bias/ variance analysis aiming at locally choosing NLM parameters is carried out in [15]; [30, 7] 
obtain Cramer-Rao type efficiency results. While finishing this paper, we became aware of 
two related papers by Maleki, Narayan and Baraniuk, addressing optimal performance in 
the context of non-parametric minimax estimation [34, 33]. [34] evaluates the performance 
of NLM for the piecewise constant horizon model [27], while [33] considers an anisotropic 
variant of NLM for the same image class. The latter shares several features with earlier work 
on anisotropic NLM [11, 12]. Our work is most closely related to [34], addressing the same 
challenge of quantifying the performance of NLM and related methods, and at the same time 
contains several novel contributions. While the paper was under review, we learned about an 
older paper of Tsybakov [56]. This paper proposes and analyzes a patch-based method that 
compares the medians over patches. The paper also derives a minimax lower bound for the 
cartoon model we consider. We comment in more detail on the work of Maleki [34] et al and 
the work of Tsybakov [56] in Section 7. 

1.1. Our contribution. We derive theoretical performance bounds for the linear filter, or- 
acle variable-bandwidth kernel methods, Yaroslavsky's filter and NLM — both the original [6] 
and a fast patch-mean based variant [31] — in the classical "cartoon" model in which an image 
consists of smooth surfaces separated by a smooth discontinuity, a popular model in statistics 
[27]. Our results are for the local polynomial versions of these methods. (The systematic 
bias associated with NLM near discontinuities — and boundaries — is shown to disappear 
when using a local polynomial regression.) We also consider nonstandard image classes, one 
modeling images with thin features and another one modeling regular textures. The latter 
is particularly significant because it highlights some of the key advantages of patch-based 
methods over, say, wavelet thresholding estimators. Previous insights into the performance 
of NLM-like methods on textures are empirical at best; we are not aware of any theory in 
this vein. Our benchmarks are two oracle inequalities, though many of our theoretical results 
can be compared directly with similar classical results in the wavelet literature and known 
minimax lower bounds on mean squared error (MSE) [14, 27]. 

The cartoon model for images has been a benchmark for image denoising methods, at 
least since the work of Korostelev and Tsybakov, condensed in [27]. This model is relevant 
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when comparing denoising methods on texture-less images. The other models are novel and 
tailored to situations where the image exhibits some thin features — like the legs of the 
Cameraman's tripod — and regular textures — like the patterns in Barbara's blouse. Though 
these models do not reflect the complexities of real images, we do gain some qualitative 
insights. First, we learn that variable bandwidth kernel methods are fundamentally limited 
by the bias near discontinuities. Yaroslavsky's filter is found to be near-optimal when the noise 
level is sufficiently low that the different regions in the cartoon image do not mix when noise 
is added; and when this is not the case, the method becomes useless. In non-local means, the 
patch size should be chosen just sufficiently large that nearby patches from different regions 
look different (in the average version of the NLM, this can be made very precise). The 
search window should be chosen like a standard kernel bandwidth. We quickly argue that not 
localizing these methods may lead to very poor performance, in agreement with [44, 64]. Also, 
while the NLM average and regular NLM perform similarly on cartoon images, the latter is 
superior when textures are present. 

1.2. Organization of the paper. In Section 2 we describe the mathematical framework. 
In Section 3 we introduce the methods that we analyze in the sequel. In Section 4 we state 
performance guarantees in the cartoon model for these methods, and in Section 5 we do the 
same in the context of the thin feature and regular pattern models. In Section 6 we perform 
some numerical experiments carefully illustrating our theoretical findings. In Section 7 we 
contrast our contribution to that of Maleki et al [34] and discuss extensions. The proofs are 
gathered in Section 8, which includes general results on local polynomial regression which may 
be of independent interest. 

1.3. Notation. We use standard notation. For non-negative sequences {(in) and (6^), 
a n = 0(b n ) (same as a n ^ b n ) if the sequence \a n /b n \ is bounded from above; a n x b n if 
a n — 0(b n ) and b n — 0{a n )\ a n — o(b n ) if a n /b n as n oo. For real numbers a and 6, 
a V b = max(a, b) while a A b = min(a, b). For a Lebesgue- measurable subset A C M, d 1 Vol (A) 
denotes its Lebesgue measure. For any x G we define its Euclidean and sup norm as 

II d I I 

^ oo — max \ Xq j . 

i=i 



We use the notation £>(0, 1) (resp. £>(0, 1)) to denote the open (resp. closed) unit ball for the 
supnorm. For 77 > 0, we define the //-neighborhood (for the norm || • ||) of a set A C M, d as 

B{A,r)) = {x e R d : dist(s, A) < 77}. 

For a discrete set A, we denote its cardinality by either \A\ or #A For a set A C is 
the indicator function of A, while for a discrete subset B C {1, . . . , m}, 1b denotes the vector 
with entries indexed by B equal to one, and all others equal to zero. Additional notation is 
introduced in the text as needed. 

2. Function estimation in additive white noise. We cast the problem of image denoising 
as a non-parametric regression problem in the presence of white noise, a standard model in 
statistics [27]. We consider the general d-dimensional problem, and use the term "image" to 
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denote any discretized signal on the d-dimensional square lattice, with important cases when 
1 < d < 4. Though patch-based methods were designed for 2D images, we consider a general 
dimension, as the same techniques may apply in color, spectral, 3D and 4D imaging [63]. 

We observe noisy samples {yi G M : i G 1%} (where I n := {1, . . . , n}) of the target function 
/ : [0, l] d [0, 1] at the design points {xi G M, d : i G 1%} corrupted by an additive noise 
{ei G R : i G I*}, as follows 

yi = f(xi)+s u ieli (2.1) 

For now, we only assume that the noise {si : i G are uncorrelated with mean zero and 
variance a 2 , though some results will require some tail bounds. Also, for concreteness, we focus 
on a standard model in image processing where the sample points are on the square lattice, 
specifically, X{ — ((ii — l/2)/n, . . . , — 1/2) jn) when i = (ii, . . . ,z^). Leaving n implicit, 
define vectors y = (yi : i G 1^), f = (fi : i G 1^) with := /(x^) and e = : i G 1^). The 
vector model can thus be written 

y = f + e. (2.2) 

We focus on estimating a function / on the grid, namely our goal is to estimate the vector 
f and we measure the performance of an estimator f in terms of (MSE): 

MSEHf) = ^%M = J_ EE(/ V/,) 2 , 

iei* 

where the expectation E is with respect to the probability measure associated with the noise. 

Although our analysis may be generalized to other norms, mean squared error is handy 
because of the point-wise (squared) bias and variance decomposition: 

nh-h?=m-h?+Hnfi)-h)\ Vie# (2.3) 

V v ' v ' 

Squared Bias Variance 
This leads for the vector estimate to the following decomposition: 

E||f - f||l = ||E(f)- f||l +E||E(f)- f||l. 

To recover the function / only through a finite number of measurements, it is customary 
to require that the targeted function belongs to a class T of structured functions such as 
smooth, piecewise smooth, or periodic textured images. In this context, the minimax risk 
over the function class T is defined as 

lV n {T) =inf supMSE/(f), 
f feT 

where the inflmum is over all the measurable function with respect to the observations. We say 
that an estimator is (rate-)optimal for the class T if its worst-case MSE over T is comparable 
to the minimax risk, i.e., (assuming implicitly that n becomes large) 

Tl n ({,F) := supMSE/C?) = 0{K* n {F)). 
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2.1. Cartoon images. We are particularly interested in situations where the function / 
has discontinuities: this is typical of images, mainly because of occlusions occurring in natural 
scenes. We say that / is a "cartoon image" if it is a piecewise smooth image with discontinuities 
along smooth hyper surf aces. This model spurred the greatest part of the research in image 
processing and is very common when no texture is present [27]. For simplicity, we consider 
that / is made of two pieces with each piece being Holder smooth. Note that all our results 
apply to the more general case where / is made of more than two pieces. For a function 
g : R d — > R and s = (si, . . . , Sd) G N d , we denote the s-derivative of g at x G R d by 

( ) <9 |s| 
9 W (x) = oil 

where |s| := si + • • • + s^. 

Definition 2.1 (Holder function class). For a, Co > 0, we define Hdi®, Co) as the Holder 
class of functions g : [0, l] d [0, 1] that are [a\ times differentiable ([a\ is the largest integer 
strictly less than a ) and satisfy 

Vx G [0, Vs G N d , 1 < \s\ < [a\ : |^ (5) (^)l < C ; (2.4) 
V(x,xO G [0,l] d , VsGN d ,|s| = [a\ : \g M (x) - g(*\x')\ < C \\x - x'\\^^ . (2.5) 

The main feature of Holder functions of order a is that they are well-approximated locally 
by a polynomial (in fact, their Taylor expansion) of degree [a\, cf. Lemma 8.1. 

Definition 2.2 (Cartoon function class). For a,C > 0, let J* cartoon (a, C ) denote the set of 
functions of the form 

f(x) = t{ xe n} fn(x) + t{ xe nc } fac(x), (2.6) 
where fac g Hd( a ^Co), with jump (or discontinuity gap) 

H(f) := inf \f a (x) - fa(x)\ > 1/Co, (2.7) 

xedQ 

and Q C (0, l) d is a bi-Lipschitz image of the (Euclidean) unit ball 5(0, 1) ; specifically, Q = 
0(,B(O,1)) ; where (j) : R. d — > R. d is infective with (f) and both Lipschitz with constant Co 
(I.e., Co-Lipschitz) with respect to the supnorm. We refer to fa as the foreground and to fac 
as the background. Moreover dQ represents the (topological) boundary of ft. 

The condition (2.7) is a lower bound on the minimum "jump"t along the discontinuity 
dQ. We require that (ft is bi-Lipschitz to ensure that the set Q is sufficiently smooth and does 
not have a serious bottleneck, which could potentially mislead the methods discussed here. 

We define the jump-to-noise ratio (JNR) for a target function / with jump and noise 

standard deviation a, as being the quantity 

JNR = f^l . (2.8) 

G 

We assume throughout that \i x 1, so that our bounds (which scale with a) reflect performance 
also as a function of JNR. In the cartoon model, we focus on the case where the noiseless 
image is at least piecewise Lipschitz, that is, a > 1. Note that our results apply to the case 
where a > 1/2, and that simple linear filtering is essentially optimal when a < 1/2. The 
setting is illustrated in Figure 2.1(a). 
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(a) Blob (b) Bowl (c) Swoosh (d) Stripes (e) Ridges (f) Barbara (g) Cameraman 



Figure 2.1. Original and noisy images: cartoon (Blob, Bowl), thin features (Swoosh), texture (Stripes) and 
natural images (Ridges, Barbara, Cameraman). 

2.2. Thin features and textures. In addition to considering cartoon images as denned 
above, we will consider images which contain other features common in natural images, such 
as thin regions a few pixel wide and regular textures. We consider simple models for these 
and show that YF and, more generally, the NLM perform much better than linear filtering. 
These models are instances of the cartoon model where the forefront Q varies with n. Let 
J 7 (a, Co) be defined as J rcartoon (a, Co) but without constraints on fi. 

As a simple model of thin feature, consider an image / in the cartoon family, but where ft 
is a thin do-dimensional surface of thickness a — which will vary with n. A classical example 
of this kind of structure is the support bar of the Cameraman's tripod, see Figure 2.1(g). An 
example of function from this class is illustrated by the Swoosh image, see Figure 2.1(c). 

Definition 2.3 (Thin feature function class). 

^ thin (a, C , d , a) := {/ E F(a, C ) : ft = {x = (x 7 , z) : dist(z, <j>{x')) < a}} , 

where (j) : (0, l) d ° (0, l) d ~ do is C -Lipschitz. 

We may similarly define a class of regular pattern functions which themselves may not be 
smooth, but which occur repeatedly across the image domain. This structure would be difficult 
to exploit with, say, wavelet-based methods that fail to take advantage of image redundancy. 
However, empirical evidence suggests that non-local adaptive kernels can perform quite well 
on these images. A classical example of this type of image structure is the striped scarf in the 
Barbara image. The following is a class where ft is made of the disjoint union of translates of 
a smaller region Qq of diameter of order a — which will vary with n. 

Definition 2.4 (Regular pattern function class). 

^pattern^ ^ a ) := J / g ffa Cq ) : Q = (0, l) d H J (H + v) I , 
[ veaZ d ) 

where S C (0, a) d is any set. Note that the union above is disjoint. 

An example of function from this class is illustrated by the Stripes image in Figure 2.1(d). 
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3. Background on kernel methods for denoising. We now describe NLM and other 
related methods. The story starts with kernel smoothing {i.e., linear filtering). Though this 
age-old method (with a proper choice of kernel) is essentially optimal when the image does 
not have discontinuities, its performance suffers dramatically in the presence of edges, which 
it tends to blur. YF, and more generally NLM, attempt to choose the kernel adapt ively so as 
to avoid averaging over the discontinuity. 

The estimates we consider are weighted averages of the pixel values of the form 

The various methods that we study in this paper differ only in the choice of weights ouij. 
Adaptation to higher order of smoothness is often accomplished by a local polynomial re- 
gression (LPR) [17, 20]. The local polynomial estimator of degree r and weights (cJij) is 



h = af 



a v J — arg mm 

a 




(3.2) 



where x s :— x^ 1 • • • x^ d , for x — {x\, . . . , xj) E and s — (si, . . . , s^) E and the minimiza- 
tion in (3.2) is over a = {a s : < \s\ < r) E K g where q = ( r ^ d )- Note that, in fact, (3.2) leads 
to an estimator of the form (3.1) with different weights {e.g., a smoother kernel) [57, p. 34]. 
We assume throughout that the polynomial degree r is sufficiently large to take full advantage 
of the smoothness of /. Specifically, if / E J rcaxtoon {a^ Co), we assume that r > [a\. When 
the number of nonzero weights in (3.2) is not enough to determine fi uniquely, we define fi as 
Hi, namely, we do not apply any smoothing. Alternatively, one could decrease the degree of 
the polynomial regression until the fit is well-defined, but this is not important in our setting. 

Since we know that / takes values in [0, 1], we clip / so that it also takes values in [0, 1]. 
This clipping does not increase the MSE. 

3.1. Linear filtering (LF). This method can be traced back in the statistics literature to 
the work of Nadaraya [38] and Watson [60] {cf. [20] for details on kernel methods). In this 
context the choice of the similarity between two pixels is only controlled by spatial proximity: 

Uij = K h {xi,Xj) , (3.3) 

where Kh{x,x f ) = K{f^ , ^) for a kernel function K and a bandwidth h > 0, which is inde- 
pendent of the location in the nonadaptive (classical) version. Common choices include the 
Gaussian kernel, but we focus on the box kernel 

K h {x,x') = \{\\x-x'\\ OQ <h}. (3.4) 



3.2. Yaroslavsky's filter (YF). YF was introduced by Yaroslavsky [62] and independently 
by Lee [28], and more modern variants such as SUSAN [48] and Bilateral filtering [55]. Here, 
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similarity between pixels is based on their spatial distance and on the relative proximity of 
image intensity at these pixels. This translates into choosing weights in (3.1) of the form 

= K h {xj,Xj) L hy (yi,yj) , (3.5) 

where K, L are kernels and h, h y the associated bandwidths. (K, h) control the spatial prox- 
imity while {L,hy) control the photometric proximity. As in classical kernel smoothing, h 
plays the role of spatial bandwidth, while h y is a photometric bandwidth. In this work we 
only consider the simple version using the box kernel: 

K hy (y,y') = l{\y-y'\<h y }. (3.6) 

3.3. Non-Local Means (NLM) and patch-based methods. NLM and other patch-based 
methods generalize the idea of including the photometric proximity in the kernel. In [6], the 
distance between two pixels is solely measured in terms of the discrepancy between patches 
surrounding the pixels considered. Though spatial proximity was already introduced in [6], 
it was only mentioned as a numerical parameter to solve a computational issue. However, 
later works (c/. [44, 64]) have shown that spatial proximity can improve NLM performance. 
We consider NLM with spatial proximity, which includes the non-local version, the two being 
identical when h is sufficiently large. 

A generic description is the following. Let hp > and let (leaving hp implicit) be the 
hypercube of width hp centered at x\, i.e., 



P \ — %i H~ 



hp 


hp 






T 





\X - Si I loo < Y } 



Such a patch corresponds to a pixel patch of width [hpn] + 1 in the digital image (where [a] 
denotes the largest integer not exceeding a G R). Let yp. = (yj : xj G P^) be the vector of 
pixel values over the patch centered at X{. With this notation, the weights used in NLM are: 



h3 



K h (x hXj ) L hy (y Pi ,y P .) , (3. 



where K, L are kernel functions and h, h y are bandwidths, as before. One classical choice of 
Lh y (which we consider in our theoretical results) is 

L hy (y Pi ,y Pj ) = i{||y Pi -yp,l| 2 < h y }. (3.9) 

The photometric similarity is based on the Euclidean distance between the patches (as vectors) 
around the pixels. We refer to this as "classical" or Euclidean NLM (or just NLM). 

Computing L^ y can be computationally intensive for large hp. To address this, some 
authors have considered projecting yp. onto a low-dimensional subspace and using this pro- 
jection to compute an approximation of Lf ly (yp v yp j ). This introduces an interesting trade-off 
between computational complexity and accuracy which is examined in [4, 54]. In this paper, 
we consider a 1-dimensional projection introduced in [31] where patches are simply compared 
via their means alone, resulting in a photometric kernel of the form 

L hy (yp^yp,-) = L hy (y Pi ,y Pj ) , y Pt := Ave(y P .). (3.10) 
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We refer to this method as NLM-average. For our theoretical results, we consider the kernel 



In our analysis, Euclidean NLM (3.9) and NLM-average (3.11) behave similarly, except 
for the regular pattern model, where the former is generally superior. In practice, however, we 
note a difference. In smooth regions, the average in (3.11) has little bias and little variance, 
making it significantly more robust to noise than the Euclidean distance (3.9). Near edges 
or patterns, however, the bias of the average in (3.11) can outweigh the variance, making 
Euclidean NLM (3.9) superior. This insight is supported by our experimental results in 
Section 6. 

The spatial bandwidth h is typically larger than the patch width hp. Common sizes used 
in practice are 21x21 kernel windows (also referred to as the searching zone) and 7x7 patches 
(in pixel units). Common kernels are the box- kernel for K and the Gaussian kernel for L. 
Though we assume box kernels for both, our results extend readily to other kernel functions. 

4. Oracle inequalities and minimax results for cartoon images. We analyze the perfor- 
mance of the kernel-based methods described in Section 3 within the mathematical framework 
detailed in Section 2. Qualitatively speaking, our theoretical results are congruent with what 
is observed in practice; see our experiments in Section 6. 

Indeed, we show that LF blurs edges, which is in fact well-known both in theory and 
practice. YF performs well when the JNR is large, and poorly otherwise. This filter relies 
on a clear gap between the pixel values on either side of the discontinuity: when the JNR is 
large, there is indeed a gap, which ceases to exist when the JNR is of order 1 (cf. Figure 4.1). 
The latter situation is where NLM shines. Indeed, patches of size larger than one pixel gather 
more information about the area surrounding the pixel, which NLM (implicitly) uses to assess 
whether two pixels are on the same size of the discontinuity. For example, comparing patches 
in Figure 4.2, we see that the means of sufficiently large patches allow us to estimate reliably 
whether each center pixel is in or not, even with an JNR of order 1. 

In what follows, we focus on the LPR variants described in (3.2) to avoid a systematic bias 
that conventional weighted average variants suffer from. It is well-known that this bias appears 
near the boundary of the image, though this can be corrected with a proper extension of the 
image. More importantly, this bias arises also near the discontinuity. Note that enforcing the 
spatial windows to have the same (symmetric) shape puts a real constraint on the resulting 
performance of the algorithm, as discussed in Section 4.3. The choice of kernel K for LPR 
variants (3.2) is unimportant for standard kernel regression as long as it satisfies some basic 
properties. (For example, in [17, Th. 3.1], the kernel does not impact the error rate except for 
a multiplicative constant.) Less is known about the impact of the choice of L. In this paper, 
we consider box kernels for both spatial and photometric components, namely (3.4), (3.9) or 



To obtain our bounds, we minimize the error with respect to the bandwidth parameter 
h, effectively striking a good balance between the bias and variance in (2.3). Indeed, the 
larger the bandwidth, the larger the bias and the smaller the variance. The issue with kernel 
smoothing — whether in the form of weighted average (3.1) or LPR (3.2) — is that it suffers 
from a substantial bias when the smoothing window (those points where the weights are equal 




(3.11) 



(3.11). 
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Figure 4.1. On the left column are cartoon images with increasing levels of noise (rows are with JNR — 
4, 2, 1 from top to bottom). A searching zone is displayed in red, for a pixel near the discontinuity. The middle 
column is a close-up of the searching zone, while the right one provides histograms of pixel values within it. 

to one) includes points from the "other side" of the discontinuity. At the same time, the 
window cannot be too small, for otherwise the variance will be overwhelming. 

4.1. Linear kernel smoothing blurs edges. It is well-known that LF blurs discontinuities. 
This comes from the fact that the window size is fixed — the same at all pixels — so points 
near and points far from the discontinuity are treated in the same way. This lack of adaptivity 
leads to a substantial MSE. How does that statement translate into a mathematical result 
within our framework? The following is proved in [2] for d < 2, though the result (at least 
the upper bound) is probably older; see also [27]. We provide a proof for LPR in Section 8. 

Theorem 4.1. Let^ F denote the linear estimator, in the form of either local average (3.1) 
or LPR (3.2) , with weights as in (3.3). We have 

inf^ n (f^ F ,^ cartoon (a,C )) -n LF := (a 2 /n d ) 1/(d+1) , 

h 

and the optimal choice of bandwidth is h x h LF := (a 2 /n d ) 1 ^ d+1 \ 

Note that the bound does not depend on the regularity a > 1 of the function /. As 
apparent in the proof, this is because LF blurs edges: to strike a good bias-variance trade- 
off, the smoothing window cannot be too small, transforming sharp edges into ramps. The 
resulting bias is then larger than the bias over the smooth regions, which is where a appears. 
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Figure 4.2. We use again the image from Figure J^.l, last row (JNR=1). The noisy image is displayed in 
the first column, with kernel supports. The second column is the result of the local (box) kernel averaging using 
support of width 1, 3 and 7 from top to bottom. The last column provides histograms of the filtered pixels 

4.2. Oracle kernel. What can we hope to achieve with adaptive kernel methods? Statis- 
ticians have used the notion of an oracle to answer this question [13, 21, 57]. We saw that 
what limits linear filtering is a large bias near the discontinuity, due to the mixing of pixels 
from both sides. What if we had access to an oracle that would identify for us the foreground 
and the background? 

The membership oracle tells us which sample points belong to or to fi c . With access 
to this oracle, we simply process the smooth pieces, fa and fac, separately. By doing so 
we achieve the minimax rate for the class %d{ a i Co) : the information this oracle provides is 
sufficient to do as well as if there were no discontinuity. This is illustrated in Figure 4.3 (best 
viewed in color). 

Theorem 4.2. Let denote LPR estimator (3.2) with weights as in (3.3) when xi and 
xj belong to the same side of the discontinuity, and set to zero otherwise. We have 

inf7e n (ff o ,^ cartoon (a,C )) x TZ MO := (a 2 /n d ) 2a ^ d+2a \ 

h 

and the optimal choice of bandwidth is h x h MO := (cr 2 //^) 1 /^ 20 ^ . 

The lower bound is a well-known minimax bound [27, Theorem 5.1.2, p. 133]. If we 
consider a class of piecewise polynomial functions, then this oracle estimator, without spatial 
proximity (i.e., h = oc), achieves the parametric rate of a 2 /n d . It is worth noting that 
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(a) Kernel smoothing (b) Bandwidth oracle (c) Membership oracle 



Figure 4.3. The kernel supports for the linear filter, the bandwidth oracle and the membership oracle 



A 




Figure 4.4. Membership oracles of order and 1 on a non-noisy ID signal. Note how the bias is reduced 
by going to the order 1. 

LPR plays a crucial role here. Indeed, the window around a point near the discontinuity — 
comprised of all points belonging to the same side of the discontinuity — will be irregularly 
shaped. For instance, imagine a point on a linear surface adjacent to the discontinuity. For 
a symmetric window (sufficiently small not to include the discontinuity) , the linear variations 
around the pixel of interest will average out and we can accurately estimate the pixel value. 
For an asymmetric window caused by the discontinuity, the linear variations will not average 
out, inducing a small bias and leading to a higher risk of order (cr 2 /n d ) 3 /^ +3 ^ when a > 3/2. 
This phenomenon can be observed in practice and is illustrated in Figure 4.4. 

Note that the oracle only has to provide the membership information locally, within the 
searching window. The insight we get from this is that we only need to know over which 
pixel values to average to attain the same error rate as we would without discontinuities. 
This is exactly what adaptive kernel methods [29], including patch-based methods, PDE 
methods [39, 1, 19] and graph diffusion methods [52, 47] aim at doing. 

4.3. Variable bandwidth kernel methods. These methods [37, 50, 22, 23, 24], including 
Lepski's method [29] and variants [40, 41], choose the bandwidth adaptively at every location, 
the goal being to avoid smoothing over discontinuities and to adapt to the regularity of the 
signal when unknown. Wavelet shrinkage methods are often thought to perform some sort of 



Minimax bounds for non-local means 



13 



variable-bandwidth kernel smoothing [13]. Clearly, we cannot do better than if we knew the 
discontinuity, meaning if we had access to the membership oracle. In that case, at each point 
we would choose the bandwidth equal to its distance to the discontinuity (BO below stands 
for bandwidth oracle). See Figure 4.3 for a comparison of the MO and BO spatial supports. 

Theorem 4.3. Let denote LPR estimator (3.2) with weights chosen as in (3.3) when 
— x j\\oo < dist(x, 9f2) := inf^ \\x — y\\oc, and set to zero otherwise. We have 

inf ^ n (f^°,^ cartoon (a,Co)) x K BO := (A n a 2 /n) V (a 2 /n d ) 2a ^ d+2a \ 

h 

where A n — logn when d — 1 and A n = 1 when d > 2, for an optimal choice of maximal 
bandwidth h x h MO . 

Note that BO achieves the error rate of MO only when d = 1, when d — 2 and a = 1, or 
when d > 2 and a 2 = 0(n -2a ( 1-1 /^ +1 ), which is polynomially small when d — 2 and a > 1 or 
when d > 3. Thus in general, BO is substantially weaker than MO. That said, BO achieves 
the minimax rate established in [56] when a is fixed. 

4.4. Yaroslavsky's filter is oracle-optimal under low noise. As the practitioner knows, 
YF can be quite good on natural images. In fact, it can dramatically outperforms the linear 
filter and compares favorably with methods such as wavelet thresholding, particularly when 
the noise level is small. We substantiate this empirical evidence with a theoretical study of 
its performance, showing it achieves MO bound in such situations (i.e., when a is small). 

Assume that for a fixed cumulative distribution function F, the noise satisfies the following 

P(M <t)> F(t/a), Vt, Vi € 4, (4.1) 

The following result states that YF achieves a performance comparable to that of MO if 
a is small. We only require that the noise distribution in (4.1) has quickly decaying tails. 

Theorem 4.4. Let denote the LPR estimator (3.2) with Yaroslavsky's weights (3.5). 

Suppose that, for some constants C,b > 0, (4.1) holds with 1 - F(t) < C exp(-(t/C) b ) for t 
large enough. Then there is another constant C > such that, if a < (C'logn) -1 / 6 , 




(a,C ))<(l + o(l))n ! 



where an optimal choice of bandwidths is fix h MO and h y x 1 . 

Gaussian noise satisfies the requirements of Theorem 4.4 with C = \[2 and 6 = 2, resulting 
in the constraint a = 0(1/ ^/logn), which is quite mild. This explains why YF tends to perform 
well in practice, at least for low noise level. 

This excellent performance hinges on the fact that the photometric kernel is able to mimic 
the membership oracle when the noise level is small. When the noise level is of order 1 or 
larger, this is no longer true, as illustrated in Figure 4.1. There, we clearly see that in a 
window containing points from both ft and its complement, the pixel values are mixed in the 
histogram if the noise level is too large, making a clear separation impossible. We formally 
argue this point after the proof of Theorem 4.4 in Section 8.2.4. 
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It is worth noting that the proof helps clarify exactly the artifacts encountered in practice 
by the YF for strong noise (cf. Figure 6.5). Indeed, the output often looks like the original 
scene contaminated by something like "salt and pepper" noise. As mentioned in the proof, 
this is because the YF does not alter pixels with extreme values. 

4.5. Performance analysis for Non-Local Means. In the previous section we established 
that YF performs as well as MO when the noise level is small, while it is useless otherwise. A 
natural strategy consists of, first, reducing the noise level by averaging and, then, applying YF. 
This is almost exactly what NLM-average does. We precisely quantify the MSE performance 
of both NLM-average and Euclidean NLM in this section. Note that we state our results for 
i.i.d. Gaussian noise for simplicity, though they are valid for many other distribution families 
such as uniform and double-exponential. 

Theorem 4.5. Let f^J^ 1 denote LPR estimator (3.2) with NLM weights (3.8) and photo- 
metric kernel either Euclidean (3.9) or Average (3.11). If the noise conditions of Theorem 4-4 
hold, then 

inf n n (i^, .F cartoo >, Co)) < (1 + o(l))1Z MO , 

h,h y y 

where hp — 1/n. Otherwise, assuming a is bounded away from 0, we have 

inf ^ n (f^ M ,^ cartoon (a,C )) r< ^ NLM := (B n /n) V (a 2 /n d ) 2a ^ d+2a \ 

h,h y ' y 

where B n := (a 4 logn) 1 /^ (Euclidean) or := (a 2 logn) 1 /^ (Average), and an optimal choice of 
bandwidths is h x h MO , h y x hy LM := a 3 y / logn (Euclidean) or := (2Cb)~ d /V 2 (Average), 
and hp x /ip LM := B n /n. In other words, if the low-noise conditions of Theorem 4.4 hold, 

then the optimal patch size is a single pixel, and the NLM is exactly YF and we achieve the 
YF performance bound. There is an elbow in the performance bound, since once the optimal 
patch size exceeds a single pixel, estimation errors within a patch sidelength of the boundary 
impact the performance. 

There is a strong correspondence between this bound and the BO bound in Theorem 4.3. 
If a is fixed, ^ NLM x K BO for d = 1 and ^ NLM x (logn) 1 /^ 60 for d > 2, therefore K BO is 
the minimax rate [56] and NLM is minimax optimal up to a logarithmic factor. 

Note that our bandwidth h is not infinite as in [34]. There, the authors use an infinite 
window for searching for matching patches: this is optimal in their setting because they con- 
sider piecewise constant images. In our setting, images are piecewise smooth, and a smaller 
bandwidth can not only help us reduce the risk of our estimate, but also lead to more com- 
putationally efficient estimation algorithms. 

5. Performance analysis for thin features and textures. In the cartoon model of Section 2 
with JNR of order 1, the performance of NLM is comparable to that of variable bandwidth 
kernel smoothing, and actually that of wavelets as well [26, 42]. In natural images, however, 
NLM can perform substantially better. We explain this by the fact that the cartoon model 
we considered so far, though useful as a benchmark, does not account for features common in 
natural images, particularly, very thin regions a few pixels wide and regular textures. 
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Below, we do as if the image contained regions of cartoon type and regions with thin 
features and/or texture, and keep the same bandwidths that we found to be optimal in the 
cartoon model in the previous results. 

5.1. Thin features. Both YF and NLM achieve a good performance on thin features. We 
focus on sample points within the feature and focus on the interesting case where the thickness 
is of smaller order of magnitude than the bandwidth h. 

Theorem 5.1. Consider f E J^ hin (a, Co, do? with band ft; assume all parameters are fixed 
except a > 4/n and a — >> as n —> oc . In terms of point-wise risk (2.3) at x\ E Vt, we have: 

1. The linear filter with bandwidth h LF has risk of order 1 if a — o(h LF ). 

2. BO with maximal bandwidth h MO has a point-wise risk of order a 2a V a 2 (na)~ d , if 
dist(a^,fi c ) > a/C for some C > 3, if na — )> oc. 

3. MO with bandwidth h MO has risk of order (h MO / a) d ~^n MO if a = o(h MO ). 

4- The latter is true of YF with bandwidths /i MO , ^xl ? if the noise satisfies the condi- 
tions of Theorem 4-4- 

5. This is also the case of NLM (Euclidean or Average) with bandwidths h = /i MO , h y = 
h^ LM , and patch size h P = h^ LM , if dist(x u fl c ) > h^ LM . 

In view of this result, we can say that linear filtering essentially erases the feature. In 
contrast, YF still performs very well (relative to the MO) under low noise, and NLM performs 
well in this case and for higher noise settings. Note that when hp LM — o(a), the bound above 
holds for most points within the thin features. Though not stated here, we found that NLM 
is able to handle such bands under special circumstances — when d > 3 and the band is 
straight. 

5.2. Regular patterns and textures. We consider very general patterns where YF will do 
as well as in the cartoon model, situations where most other methods are essentially useless. 
Euclidean NLM performs well too, under additional assumptions on the pattern. 

Proposition 5.1. Consider f E ^^^(a, Co, a) with all parameters are fixed except for 
a, which satisfies a = o(h MO ). Let Nq := #{z : xi E £1} and Nqc is defined similarly. If 
Nq V Nqc < C(Nq A Nqc) and na > (r + 1)(2C + 2), with C > 1 fixed, we have the following: 

1. MO with h = h MO achieves an MSE of order TZ MO . 

2. The latter is true of YF with bandwidths /i MO , h y x 1 ; if the noise satisfies the condi- 
tions of Theorem 4-4- 

3. Suppose in addition that for every X{ E Vt and xj E Vt c , 

||i(P i nfi)-i(P 7 -nfi)||i > (<7 2 logn)/C, (5.1) 

for some C > 1 fixed. Then (Euclidean) NLM with bandwidths h = /i MO , h y = h^ LM 
and patch size /ip LM ; achieves an MSE of order (na) d TZ MO . 

The condition (5.1) essentially means that any two patches, where on is centered in the 
foreground and the other is centered in the background, must be sufficiently distinct - and the 
necessary degree of distinction increases with the noise level. For instance, (5.1) is satisfied 
by such patterns as a chessboard or stripes. A regular pattern in a real image (e.g., Barbara's 
blouse) is often referred to as texture, and NLM is able to effectively denoise such patterns 
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a = 5 


a = 20 


cr = 50 


cr = 100 


r = 
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13 


23 


35 


r = 


1 


9 


17 


25 


33 


r = 


2 


23 


41 


59 


61 



Table 6.1 



Spatial bandwidth h used in practice obtained by minimizing the MSE of the MO on Bowl (cf. Figure 6.1). 

under some regularity conditions. For random models of textures, such as Markov random 
fields, we do not expect NLM to do well unless the pattern is not very random. The reason is 
that few patches are close in Euclidean distance to a given patch. 

6. Experiments. In this section we provide numerical results for images with d = 2, 
whose pixel intensities are between and 255. In our experiments, the noise is Gaussian with 
standard deviation a E {5, 20, 50, 100} (Note that this corresponds, for normalized images in 
[0, 1] to noise with a E {5/255, 10/255, 50/255, 100/255}). 

On both toy and classical images, we have compared the behavior of the following methods: 
linear filtering (LF), Yaroslavsky's filter (YF), Euclidean NLM (NLM), average non-local 
means (NLM-average) and the membership oracle (MO). In all cases we have implemented 
LPR version of the methods for the orders r E {0,1,2}. Note that, as expected, for linear 
filtering LPR of order and 1 are exactly identical because the support of the kernel is 
symmetric. However, for other methods the symmetry of the support is no longer guaranteed 
and the estimators differ. The higher order LPR versions are computed by solving the linear 
system in (8.5). A small numerical constant (10 -8 ) is added to the diagonal elements of X T X 
so that inverting this matrix is always a well conditioned problem. 

For fair comparisons we have used the same box kernel with every method. The patch 
size is 7 x 7 (i.e., hp = 7). It is kept fixed for all the methods. For the spatial bandwidth h 
we have chosen to use the values obtained by considering the best h (in term of MSE) for the 
MO, on the Bowl image. Thus, we use for each noise level a E {5, 20, 50, 100} and polynomial 
order r E {0, 1, 2} an h optimized on Bowl. The values are provided by the MSE optimization 
in Figure 6.1 and summarized in Tab. 6.1. The photometric bandwidth h y is chosen by hand, 
and differs from method to method: VlOa (YF), 0.29a (NLM-average), 13.1a (NLM), 30 
(MO). It is to be noted that the parameters are given for comparison in between methods, we 
do not claim those are the best parameters for all applications. 

In practice, h needs to increase with r to ensure that the LPR is stable. Note that there 
are q = ( r ~^ d ) polynomial coefficients in each search window. If we apply the rule of thumb of 
10 observations per unknown parameter, the search window needs to include about lOq pixels. 
This is illustrated in Figure 6.1, where we see the best h increasing with r. 

Since for natural (not cartoon-like) images MO is irrelevant, we have used a modified YF 
oracle instead. This oracle has access to the original image to compute the weights as in (3.3), 
and then performs LPR on the noisy pixel values with these weights. For piecewise constant 
images, this coincides exactly with MO as soon as the bandwidth is large enough. 

The experiments conducted show that LF is always outperformed in practice by YF, 
NLM and NLM-average. For low noise level (a = 5), the YF with r = 2 outperforms the 
other methods (cf. Figure 6.6 and Table 6.2). However, in the presence of strong noise the 
NLM and the NLM-average are the clear winners on most images. Interestingly, and may be 
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Figure 6.1. MSE with respect to h for the image Bowl, with noise level a £ {5, 20, 50, 100}. 



surprisingly the NLM-average can even improve on the NLM for very strong noise, even for 
natural images. On the other hand, one can see that the NLM-average mimics the behavior 
of LF for textured images with strong noise (cf. Figure 6.9), due to the fact that different 
sides of periodic features are averaged together. This limitation is particularly obvious for the 
Stripe image 6.9. 

The influence of the degree r of the LPR depends on the nature of the image being denoised 
(e.g., natural vs. cartoon images). In practice it remains unclear how this parameter should 
be tuned. Figure 6.2, Figure 6.3, Figure 6.4, and Figure 6.5 demonstrate the importance of 
the jump parameter ji in practice. On the left end of the Swoosh, the jump is larger and we 
reconstruct it accurately across all noise levels. On the right end, the jump is smaller and the 
performance degrades with a, exactly as predicted by our theory. 

The MATLAB codes are available on the authors' webpages to reproduce those results. 

7. Discussion. The theoretical results of the preceding sections are summarized in Ta- 
ble 7.1. 

As described in the Introduction, the bounds described in this paper and in the indepen- 
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01 


918 


08 


646 


76 


MOO 


14 


19 


15 


12 


22 


74 


80 


33 


19 


90 


61 


09 


54 


41 


MOl 


18 


31 


17 


96 


23 


06 


91 


72 


22 


38 


76 


21 


65 


36 


M02 


17 


76 


17 


65 


18 


95 


88 


60 


22 


09 


72 


00 


62 


53 



Table 6.2 



MSE comparisons of the denoising methods considered for LPR of order 0, 1 and 2. The compared methods 
are the Linear Filter (LF), the Yaroslavsky Filter (YF), the NLM-average (NLM-Av.), the classical NLM and 
the Membership Oracle (MO). Results are averaged over 5 Gaussian noise replicas. 
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Noisy, MSE (b) LFO, MSE (c) LF1, MSE (d) LF2, MSE 

2.50e+01 4.03e+01 5.55e+01 8.29e+01 




(e) YFO, MSE = (f) YF1, MSE = (g) YF2, 
1.69e+00 1.31e+00 1.19e+00 



MSE 



(h) NLM-Av.O, 
2.56e+00 



MSE 




(i) NLM-Av.l, 
1.90e+00 



MSE 



(j) NLM-Av.2, 
1.75e+00 



MSE 



(k) NLMO, 
1.54e+00 



MSE 



(1) NLM1, 
1.36e+00 



MSE 




(m) NLM2, 
1.37e+00 



MSE 



(n) MOO, 
1.79e+00 



MSE 



(o) MOl, 
1.26e+00 



MSE = (p) M02, 

1.31e+00 



MSE 



Figure 6.2. Toy thin feature image (Swoosh) corrupted Gaussian noise with a = 5. 
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(a) Noisy, MSE 
3.99e+02 



(b) LFO, MSE 
9.18e+01 



(c) LF1, MSE 
1.35e+02 



(d) LF2, MSE 
2.09e+02 




(e) YFO, MSE 
1.71e+01 



(f) YF1, MSE 
2.08e+01 



(g) YF2, MSE 
2.49e+01 



(h) NLM-Av.O, MSE 
9.86e+00 




(i) NLM-Av.l, MSE 
8.37e+00 



(j) NLM-Av.2, MSE (k) NLMO, MSE 
7.87e+00 6.19e+00 



(1) NLM1, MSE 
6.00e+00 




(m) NLM2, MSE 
4.53e+00 



(n) MOO, MSE 
4.54e+00 



(o) MOl, MSE 
3.39e+00 



= (p) M02, MSE 
2.85e+00 



Figure 6.3. Toy thin feature image (Swoosh) corrupted Gaussian noise with a — 20. 
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(a) Noisy, MSE (b) LFO, MSE (c) LF1, MSE (d) LF2, MSE 

2.50e+03 2.13e+02 2.35e+02 2.91e+02 




(e) YFO, MSE = (f) YF1, MSE (g) YF2, MSE (h) NLM-Av.O, MSE 

1.38e+02 1.56e+02 1.78e+02 3.11e+01 




(i) NLM-Av.l, MSE (j) NLM-Av.2, MSE (k) NLMO, MSE (1) NLM1, MSE 

2.91e+01 2.64e+01 3.54e+01 4.25e+01 




(m) NLM2, MSE = (n) MOO, MSE (o) MOl, MSE (p) M02, MSE 

4.35e+01 8.61e+00 8.32e+00 6.40e+00 



Figure 6.4. Toy thin feature image (Swoosh) corrupted Gaussian noise with a — 50. 
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(m) NLM2, MSE = (n) MOO, MSE (o) MOl, MSE (p) M02, MSE 

1.68e+02 1.51e+01 1.82e+01 1.86e+01 

Figure 6.5. Toy thin feature image (Swoosh) corrupted Gaussian noise with a — 100. 
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o o o o 

(a) Noisy, MSE (b) LFO, MSE (c) LF1, MSE (d) LF2, MSE 

2.50e+01 5.76e+01 7.40e+01 1.05e+02 



o o o o 

(e) YFO, MSE = (f) YF1, MSE = 9.88e-01 (g) YF2, MSE = 8.74e-01 (h) NLM-Av.O, MSE 
1.71e+00 3.41e+00 

o olio o 



(i) NLM-Av.l, MSE (j) NLM-Av.2, MSE (k) NLMO, MSE (1) NLM1, MSE 

2.50e+00 2.15e+00 1.67e+00 1.60e+00 



o o o o 

(m) NLM2, MSE = (n) MOO, MSE = 9.56e- (o) MOl, MSE = 5.01e- (p) M02, MSE = 3.48e- 
1.50e+00 01 01 01 

Figure 6.6. Toy cartoon image (Bowl) corrupted Gaussian noise with a = 5. 
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(a) Noisy, MSE 
3.99e+02 




(e) YFO, MSE 
1.53e+01 





(b) LFO, MSE 
7.78e+01 



(f) YF1, MSE 
1.80e+01 




(c) LF1, MSE 
1.04e+02 




(g) YF2, MSE 
1.90e+01 




(i) NLM-Av.l, MSE (j) NLM-Av.2, MSE 
6.62e+00 5.10e+00 



(k) NLMO, MSE 
5.86e+00 






(d) LF2, MSE 
1.40e+02 





(h) NLM-Av.O, MSE 
7.55e+00 





(1) NLM1, MSE 
5.83e+00 





(m) NLM2, MSE 
4.92e+00 



(n) MOO, MSE 
4.12e+00 



(o) MOl, MSE 
3.03e+00 



= (p) M02, MSE 
2.37e+00 



Figure 6.7. Toy cartoon image (Blob) corrupted Gaussian noise with a — 20. 
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(a) Noisy, MSE (b) LFO, MSE 



2.50e+03 



9.00e+02 



(c) LF1, MSE 
9.55e+02 



(d) LF2, MSE 
1.15e+03 




(e) YFO, MSE 
6.53e+02 



(f) YF1, MSE 
6.99e+02 



(g) YF2, MSE 
8.10e+02 



(h) NLM-Av.O, MSE 
4.20e+02 




4.17e+02 



(j) NLM-Av.2, MSE 
4.26e+02 



(k) NLMO, MSE 
4.50e+02 



(1) NLM1, MSE 
4.76e+02 




(m) NLM2, MSE 
4.96e+02 



(n) MOO, MSE 
5.28e+01 



(o) MOl, MSE 
5.86e+01 



= (p) M02, MSE 
5.18e+01 



Figure 6.8. Barbara image corrupted Gaussian noise with a = 50. 
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(a) Noisy, MSE 
9.98e+03 




(e) YFO, MSE 
8.91e+03 




(i) NLM-Av.l, MSE 
7.51e+03 




(f) YF1, MSE 
9.26e+03 




(j) NLM-Av.2, MSE 
7.25e+03 




(c) LF1, MSE 
1.78e+04 




(g) YF2, MSE 
8.87e+03 




(k) NLMO, MSE 
2.42e+02 




(d) LF2, MSE 
1.71e+04 




(h) NLM-Av.O, MSE 
7.39e+03 




(1) NLM1, MSE 
2.10e+02 



(m) NLM2, MSE = (n) MOO, MSE (o) MOl, MSE = (p) M02, MSE 

2.33e+02 2.11e+01 2.37e+01 2.28e+01 

Figure 6.9. Toy texture image (Stripes) corrupted Gaussian noise with a — 100. 
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Image class 


Method 


Bound 


^•cartoon 


MO 

LF 

YF 


K n X Tl MO := ( a y n dy2a/(d+2a) 
Tin ~ (<T 2 /™ d ) V(d+1) 

Tin < (1 + o(l))7e MO (for low noise) 




NLM 


Tl n < < 


(1 + o(l))ft MO for low noise 
[{a 4 log n) l / d /n] V ft MO otherwise 




NLM-Av. 


Tl n 1 < 


(1 + o(l))Tl MO for low noise 
^ [(a 2 log n) 1 ^/^ V Tl MO otherwise 




MO 


Tl n ~ ^ MU 


^pattern 


LF 
YF 


Tln-1 

Tin =< ft MO (for low noise) 




NLM 




(1 + o(l))Tl MO for low noise 
[(na) d Tl MO for "distinct" patterns 




MO 


E(/i- 


fi) 2 x (/i MO /a) d -*7e MO 


^•thin 


LF 
YF 


E(£- 
E(£" 


h? ^ 1 

/i) 2 r< (h MO /a) d - do Tl MO (for low noise) 




NLM 


m- 


/i) 2 r< (/i MO /a) d - d o7e MO 




NLM-Av. 


m- 


/i) 2 r< (/i MO /a) d -*7e MO 



Table 7.1 

Summary of results 



dent work [34, 33] address fundamental performance limits of NLM and related photometric 
image filtering methods. These methods have an established history of strong empirical perfor- 
mance on natural images, but until now little was known about how these methods performed 
asymptotically, especially with respect to related methods based on computational harmonic 
analysis (e.g., wavelet or curvelet denoising). 

Both our bounds and the bounds in [34] suggest that NLM has some limitations for piece- 
wise smooth images when the noise is not small (i.e., when YF cannot perform effectively). 
When the noise is small, we note that YF is a special case of NLM with a patch size of one 
pixel, and the performance of NLM hinges upon our ability to measure the similarity of two 
patches based on noisy observations. In low noise, this similarity can already be estimated 
quite accurately with a single pixel patch. In stronger noise, the similarity measured through 
larger patches is more robust to noise — but larger patches also introduce some bias. This 
results in an elbow in the performance bounds for NLM. Recent empirical results suggest that 
these limitations can be mitigated by adapting the kernel shape [53], the patch shape [12], or 
spatial bandwidth h [26]; a theoretical understanding of this kind of adaptation remains an 
open problem. 

There are several distinctions between our work and the closely related work in [34] that 
bear mentioning. First, we consider the cartoon model, where the functions are piecewise 
Holder smooth images with a discontinuity set corresponding to a Lipschitz mapping of the 
unit ball, while Maleki et al. [34] consider the horizon edge model, where the functions 
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are piecewise constant with a discontinuity set corresponding to the graph of a Lipschitz 
function. Though they actually consider smoother edges, their analysis reduces to the case 
of Lipschitz smoothness. We consider images in arbitrary dimension — showing that NLM 
behaves differently when d > 3 — while Maleki et al. consider the case of 2D images (d = 2). 
Because they consider functions that are piecewise constant, they use the weighted average 
version (3.1) without spatial localization (i.e., h = oc). Because we focus on smooth — not 
necessarily constant — regions, we need to localize both YF and NLM. Applying the more 
complex LPR (3.2) enables YF to adapt to the degree of regularity in each smooth region. 
We note that Maleki et al. [34] do not consider the case of low noise and simply show that 
YF achieves the same performance as LF — which is also our conclusion in strong noise. 

Moreover to simplify the analysis, Maleki et al. consider a slightly modified version of 
NLM and derive lower bounds for oracle versions of YF and NLM. The lower bounds for 
NLM were also challenging for us and we only provide heuristics. We mention that our results 
imply that the simpler NLM-average achieves the same performance as NLM in the horizon 
model. 

Let us also comment on the work of Tsybakov [56]. This paper suggests and analyzes 
(within the cartoon model) a method very similar to NLM-average, based on medians rather 
than means. The method is based on non-overlapping patches. This allows the method to be 
applicable in situations where the noise distribution is heavy-tailed. (We were not aware of 
this work when we prepared our paper.) 

Our analysis of NLM for image classes with thin features or regular patterns is also a 
significant novel aspect of our work. Though we expect wavelets are near-optimal for cartoon 
images when the discontinuity is Lipschitz, NLM has a significant empirical advantage over 
wavelets for certain kinds of repeating textures. We develop a model for images with these 
features, and note that it does not approach cartoon or horizon model asymptotically. For 
this image class, we demonstrate that NLM performs as well as it does for the cartoon class. 

The current bounds are based on ideal bandwidths which depend on the unknown smooth- 
ness parameter a. Thus we have demonstrated that the adaptive filtering techniques consid- 
ered adapt to the discontinuity fi, but not to a. We anticipate that adapt ivity to a is indeed 
possible and leave that analysis for future work. 

We note that NLM is not the current state-of-the-art image denoising method in common 
use. More evolved patch-based methods utilize sparse representations of patches, adaptive 
kernel bandwidths, and adaptive patch shapes (cf. [9, 10, 32, 11, 49, 53]). While these aspects 
are not considered in our analysis, the theoretical insights provided by this paper may po- 
tentially lead to an improved understanding of a broad class of patch-based image denoising 
methods and subsequently better algorithms (cf. [46] for possible directions). 

8. Proofs. In this section, C, Ci,C2,... denote finite positive constants that do not 
change with n and whose actual value may change with each appearance. 

8.1. Preliminary results. We first gather some basic results. 

8.1.1. Some analysis. Functions in T-Ld(a,Co) are uniformly well-approximated locally 
by polynomials of degree |_cej , specifically their Taylor expansions. For g E !~ld( a ,Co) and 
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x E [0, l] d , the Taylor expansion of g at x of degree t E N is defined as follows: 

|s|<t i=l l ' 

Lemma 8.1. For any g E ^(a, Co), 

l#')-t J 9Ml <c a C ||x , -x||° , Vx,x'e [0,l] d , 

where 

c a := ^ 



Proof. Though this sort of result is well-known, we provide a proof for completeness. A 
Taylor approximation of degree [a\ gives: 

g(x') = T^g(x') + £ -*«(*)) II (X ^f r \ 

| S |=H <=i 

for some 2 on the segment joining x and x'. Hence, 

\g(x')-TWg(x')\ < c a \\x' - x\\ W max - 

Now apply (2.5) and the fact that ||z — x]]^ < \\x f — x]]^ to get 

\g^(z)-g^(x)\<C \\x'-x\\^, Vs e N d , \s\ = [a\. U 
8.1.2. Some geometry. For a measurable set A cR d , 

p(A):= inf inf sup ( ™^ V ' ^ : B(y, s) C B(x, h) P) a\ . (8.1) 
/ie(o,i)xeA [ Vol(5(x,/i)) v 7 J 

The quantity p(A) provides some measure of how irregular the boundary of A is. The following 
lemma bounds p from below for sets whose boundary is sufficiently regular. 

Lemma 8.2. Let (j) : R. d R. d be infective, with (j) and both C-Lipschitz. Then for 
Q = 0(5(0, 1)) and p defined in (8.1) ; we have min(p(f2), p(n c )) > (2C)~ d . 

Proof. Fix x E ft and h > 0. Since (j) is Lipschitz with constant C, we have 
0(5(0 _1 (x),/i/C)) C 5(x,/i). Note that z := _1 (x) E 5(0,1) and, by the triangle in- 
equality, B(z, h/C) H 5(0,1) D5(z',f), where z f := (1 - h/(2C))z and t := h/(2C). Because 

is C-Lipschitz, we have (j)~ l {B{(f){z'),t/C)) C B(z* ',£), so that 

5) C 0(5(^, t)) C 0(5(z, /i/C) n 5(0, 1)) C 5(x, /i) fl ft, 

where y := 0(z 7 ) and 5 := £/C. We obtain a lower bound for ft c in a similar way. ■ 

Next is a result on the number of sample points within a certain distance of a subset. Let 
X n d be the set of sample points, that is, X n d = {xi : i E I d }. 
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Lemma 8.3. For any subset A C (0, l) d of the form A = B(A > \rj) for some A! C (0, \) d 
and 4/n < r\ < 1, 

8"VVol(A) < \An X n d\ < 4VVol(A). 

Proof. Let z\,...,Zk E (0,1)^ be a maximal //-packing for A' {i.e., the balls B{zj 1 rj/2) 
for j = 1, . . . , k are disjoint and included in A' C A, and for any z E A', there is j such that 
z E B(zj,rj)). By the triangle inequality, we have 

(J B( Zj , V /2)cAc |J 5(^,2^). 

j=l,...,k j=l,...,k 

On the one hand, taking volumes on all sides, we get krf < Vol (A) < k2 d (2r}) d 1 since the unit 
(II • Hoc) ball has volume 2 d . This turns into 4:~ d Vol(A) < krf < Vol (A). On the other hand, 
counting sample points on all sides, using the fact that 

rfn d < \B(z,rf)nX n d\ < (2r)) d n d , Vz E (0, Vr/ E (2/n,l), 

we get 

fc fc 
k(r)/2) d n d < ^2\B(zj,ri/2) n X nd \ < \AnX nd \ < ^ \B(zj, 2rj) D X n a\ <k(4,r]) d n d . 
3=1 j=i 

Combining these, we get the desired result. ■ 

Lemma 8.4. Suppose 1 < do < d are integers and let (j) : R^ R. d be infective, with (j) and 
_1 (on the range of (j)) both C-Lipschitz with C > 1. Then there is another constant C > 1 
such that, for A := ^((0, a) d °) and h E (0, 1), 

l_ a do h d-d < Vol(S(A,/i)) < C f a do h d - do . 
Consequently, if (j) : M, d — >► R rf zs as a&ove and A := 0(55(0, 1)) ; t/ie result holds with d—do = 1. 

Proof. We first observe that, for any z E and /i > 0, since is C-Lipschitz, 

<f>(B(z,h)) C B(<f>(z),Ch). (8.2) 

Now, let zi,...,z m denote a maximal /i-packing of (0, a) d °. Note that m x (a/h) d ° when 
/i < 1. By definition ||^ — Zj|| > /i, so that ||0(^) — </>(zj)|| > h/C since </> _1 is C-Lipschitz 
with C > 1. Hence, 

□ B^(zi), h/C) c B(A, h/C) c B(A, h), 

i=l,...,k 

implying 

m 

Y,Vo\{B{<j>{ Zi ),h/C)) < Yo\(B(A,h)). 

i=i 
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We then conclude by the fact that YaLi Vol(B(0(^), h/C)) x mh d x a d °h d - d °. For the upper 
bound, we use the fact that (0, a) d ° C Ui=i 1 ... 1 kB(ziiti)i so that 

Ac |J <t>{B{zi,h)) <z |J B(</>(zi),Ch), 

i=l,...,k i=l,...,k 

by (8.2). Hence, using the triangle inequality, 

m 

Vol(S(A, /i)) < ^ Vol(B(0(^), C7i + ft)) x mft d x a dQ h d ~ d \ 
%=i 

For the second part, we use the fact that dB(0, 1) = U^((0, l) d_1 ) for a finite set of functions 
satisfying the requirements and the fact that the composition (j) o cj)£ is also Lipschitz. ■ 

8.1.3. Some statistics. We establish here some bounds on the point-wise MSE (2.3) of 
LPR (3.2). We mention that much finer results exist in dimension d = 1 for the case where 
the underlying function / is smooth; see [17] and references therein. 

Lemma 8.5 (Variance). 

For any sufficiently large constant C > 0, depending only on d, r, the following is true. 
Consider the LPR estimator of the form (3.2), with weights Uij E {0,1}. Assume that 
Bf 1 C Ai := {j : Uij = 1} C Bi^ := {j : xj E B(xi,h)}, for some discrete ball Bf 1 
satisfying |£>^ n | > \Bi^\/C for some constant C. 

^a 2 (nh)~ d < Var(^) < Ca 2 (nh)- d . (8.3) 

Proof We assume without loss of generality that X{ — and drop the subscript i for 
simplicity. Below C denotes a generic constant that may change with each appearance. Let 




which is the number of monomials in d variables of degree r or less. Let X denote the \A\ x q 
matrix with coefficients {x s - : j E A, \s\ < r). By definition of the local polynomial estimator 
(3.2) and the usual least squares formula, we have 

/ = e T (X T X)- 1 X T y, (8.5) 

where e = (1, 0, . . . , 0) E R q and y := (yj : j E A) (assuming that X is full-rank, which we 
prove further down). In particular, 

Var(/) = a 2 e T (X T X)- 1 e, 

since yj = f(xj) +£j, with the noise (ej) being uncorrelated and having identical variance a 2 . 

Let Zj = Xj/h and Z = [z s - : j E A, \s\ < r), and also let H = diag(/J s l, \s\ < r), so that 
X = ZH, leading to 

Var(/) = a 2 e T H- 1 (Z T Z)- 1 H- 1 e = a 2 e T (Z T Z)- 1 e, (8.6) 
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since H _1 e = e. This is because H is an invertible diagonal matrix with first element equal 
to 1. The reason we work with Z instead of X is that, under the conditions assumed here, 
Zj E [—1, l] d (because j E A) and (nh)~ d Z T Z is bounded from above and below in terms of its 
spectrum. Indeed, define matrices Zi = {z s - : j E 5 m , \s\ < r), Z2 = {z s - : j E A \ 5 m , \s\ < r), 
Z3 = (Z? : j E 5^, |s| < r) and Z4 = (z| : j <E Bh \ A, \s\ < r). Let -< denote the ordering for 
positive semi-definite matrices. Since 

Z^Zi -< Z^Zi + Z^Z2 = Z^Z = Z^Z3 — Z^Z4 -< Z^Z3, 

it suffices that we focus on proving a lower bound on the spectrum of Z^Zi and upper bound 
on the spectrum of Z^Z3. Consider therefore the case where A itself is a discrete ball, say 
A = { j : xj E 5(x,a/i)}, where a E (C -1 /^, 1) by assumption. Let z — x/h. First, assume 
that a and h remain fixed. Then for s,t E N d such that |s| V \t\ < r, we have 

\- 1 (Z T Z) st = (nh)- d V z s - +t -> M st := I u s+t d Ul when n/i -> oc, (8.7) 

^) 7^ J JB(z,a) 



V 7 J^A 



recognizing a Riemann sum on the LHS. So, if M = {M 8 t : |s| V |t| < r), we have the 
convergence (nh)~ d Z T Z M, when n/i — >► oc. M is a well-defined positive semi-definite 
matrix since its elements are bounded by 1 — because B(z,a) C 5(0,1) — so we only 
need to show that it is positive uniformly over a E (C -1 /^, 1). Let X z ^ a denote the smallest 
eigenvalue of M with integral over B(z,a), with z E 5(0,1) and a E (C -1 /^,!). We want 
to show that X z ^ a is bounded away from 0. Suppose this is not the case, that there are 
sequences (z m ,a m ) such that X Z m,am ~~ ^ as m —> oc. By compacity, we may assume that 
(zm,a>m) ~+ (^00,^00) E 5(0, 1) x [C _1/d , 1]. Then \ Zoo , aoo = 0, by continuity. Let be the 



associated matrix. Then there is E R g nonzero such that 

^ ( > \ boo yS u s \ 



o = b^ M 00 b 00 = / J2 b °°° b °°> tuS+tdu = / (J2 bc 



where the sums are over s E N rf such that |s| < r. This leads to a contradiction since the 
polynomial in the second integral cannot be zero on a nonempty ball. 

So far, we assumed that a and z were fixed. Assume this is not the case. The upper 
bound on the largest eigenvalue of Z T Z is bounded in the exact same way, using the fact that 



\zj\\ < 1. For the lower bound we still have that 



liminf (nh)- dS T z s - +t > inf / u s+ \du 

where the inf is over z' and a' such that a' E (C -1 /^, 1) and B{z r , a') C 5(0, 1). Our arguments 
apply to the RHS. We conclude that there is C\ E (0, oc) such that, for nh large enough, 

^r(nh) d < A min (Z T Z) < A max (Z T Z) < C^nhf. (8.8) 

We then redefine C as max(C, C\) and conclude with (8.6). ■ 
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Lemma 8.6 (Bias: Upper Bound). Assume that f E J 7caxtoon (ct,Co), with foreground ft ; and 
that the conditions of Lemma 8.5 also hold. If moreover either Ai C ft or Ai C ft c ; then, for 
some constant C > 0, the following inequality holds 

m-fi) 2 <mm(hCh 2a ). (8.9) 



Proof We continue with the notation introduced in the proof of Lemma 8.6. WLOG, 
assume A C ft. In that case, / is smooth in the window, since / = fa. By Lemma 8.1, 
f(xj) = T^f{xj) + g(xj), where Tq f is a polynomial of degree at most [<^J < ^ and 
\g{xj)\ < c a Coh a for all j E A. Now, for a polynomial p of degree at most r, let p = (p(xj) : 
j E A) so that p = Xa for some aER g , and we have 

e T (X T X)- 1 X T p = e T a = a = p(0). 

With this reproducing formula and the fact that T ^/(0) = /(0), 

E/ - /(0) = e T (X T X)- 1 X T g = e T (Z T Z)- 1 Z T g. 

Because of (8.8), we have 

\f(2?Z)-Ws\ < "^J'lrzj' 2 ^ ^K)- d HZ T gll2, (8-10) 

where C\ is the constant of Lemma 8.5. But the entries (Z T g) s = YljeA9(. x j) z j are uniformly 
bounded by \A\ • c a C /i a = 0((nh) d h a ), so that the RHS in (8.10) is of order 0(h a ). This 
imply that (E/ — /(0)) 2 < C2h 2a for some constant C2, and we conclude by redefining C as 
max(C,Ci,C 2 ). ■ 

Lemma 8.7 (Bias: Lower Bound). Let f = 1q, where ft = (0,1/2) x (0, l)^ -1 and linear 
filtering (meaning Uij = 1 if and only if, \\x{ — xj\\ < h/2). Then there is a constant C > 
such that, when dist(:z^, <9ft) < h/C, we have 

m - hf > i/c. (8.11) 

Proof. We continue with the notation introduced in the proof of Lemma 8.6. In particular, 
we translate everything so that X{ — 0. WLOG, assume that X{ E ft and let 5 = dist(x^, 9ft). 
Let Aq = {j : xj E D ft} and define Aqc similarly. Using the reproducing formula, we get 

E/ - f(xi) = e T (Z T Z)- 1 Z T l An - e T (Z T Z)- 1 Z T l = -e T (Z T Z)" 1 Z T l Anc . (8.12) 

Assume that S < h, in which case Aqc = {j : Xj E (5, h) x (— /i) d_1 }, in which case the RHS 
in (8.12) is equal to —G(5/h), where 

G(a) := e T (Z T Z)" 1 Z T l{ J - : ^. G ( a5l)x (_ 11) d-i}. 
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It suffices to show that there is C > such that G{a) > 1/C when a < 1/C. Assume this 
is not the case, in which case there is a m — >> such that G(a m ) — >> 0. As in the proof of 
Lemma 8.6, recognizing Riemann sums we see that, as m oc, 



J(-l,0)x(-l,l) d - 1 

We have u + v = M • 1 (where 1 = (1, . . . , 1)), so that e T M _1 (u + v) = 1, and therefore 
e T M _1 v = 1/2 by symmetry. Hence, G(a m ) — >► 1/2, which is a contradiction. ■ 

This lemma states a lower bound on the squared bias of linear filtering when / is an 
indicator function of a half hypercube. The result is actually much more general. Any 
function / in the cartoon class has a foreground Q whose boundary is well-approximated by a 
hyperplane — since d£l is Lipschitz — and / is approximately locally piecewise constant (/ is 
smooth on ft and fl c ). Hence, near the discontinuity, / resembles the function in Lemma 8.7. 

8.1.4. Some probability. The following result asserts that the maximum of m identically 
distributed random variables with exponentially decaying tails is at most a power of logm. 
Lemma 8.8. Suppose X\, . . . , X m are such that for some a,b,c > ; 




p(z) := (z s : \s\ <r), 



and 




Let 





P(|X r | >t) < cexp(-(Va) 6 ), Vt>c, Vr = l,... 



rn. 



Then for m sufficiently large, 




Proof Define x m — 



a(21ogm) 1 / 6 . By the union bound, 



P(max(|X!|,...,|X m |) >x m ) <P(|Xi| > x m ) + • • -F(\X m \ > x m ) 

< mcexp(—(x m /a) b ) 
cm' 1 0. ■ 



Lemma 8.9. For Xi ~ AA(0, of) for i = 1, . . . , m, and any C > ; we have 




l-C 
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Proof. Fix t > 1 and let a = maxi<^< m c^. By the union bound and the fact that 
P (AA(0, 1) > t) < exp(-t 2 /2), we have 

✓ \ m m 

F ( max \Xi\ > t) < VP(|Xi| > t) < V exp(-t 2 / (2a^)) < mexp(-t 2 /(2cr 2 )). 

\ Ki<m J ^ — ' ^ — ' 

V 7 1=1 1=1 

We then plug in t — a\/2Clog m. ■ 

Lemma 8.10. Suppose Xi ~ / or ^ = 1? • • • ? m - There is a constant C\ such that, for any 
C>Ci, ifk> (64/9)Clog(ra), we have 

P ^max Xi > k + 2 y/Ck log rn^j < m l - c/2 (8.13) 
P ( min Xi < k - 2^Ck\ogm ) < m 1_c/2 (8.14) 

\l<i<m J 

Proof Let us prove the first inequality. Since the moment generating function of a xi * s 
t -> (1 - 2t)- k 'H{t < 1/2), Chernoff's bound gives 

F(Xi >t)< exp(-(t- fc)/2 + (fc/2)log(t/fc)) , Vt > fc. 



We then use the inequality log(l + x) < x — x /2 + x /3, valid for x E (0, 1) and input 
t = k + 2^/Ck logm, to get 

-(t - k)/2 + (fc/2) log(t/fc) < -(t - A:) 2 /(4A:) + (t - A:) 3 /(6fc 2 ) 

= -C log m + (4/3)C 3/2 v/log(m)/A: logm, 

and bound the second term by (C/2) logm. We then obtain 

P(Xi > t) < m~ c/2 , 

and apply the union bound as before. The second inequality is proved in the same way 
considering that log(l — x) > —x — x 2 /2 — x 3 /3 holds for x E (0, 1). ■ 

Lemma 8.11. Suppose Xi ~ Xk(^i) (non-central chi-square) for i = 1, . . . , ra. There is a 
constant C\ such that, for any C > C\, if k > 16Clog(m) and 5 m i n := min^ 5i > 2^/ Clog m , 
we have 

P ( min Xi < 5 min /4 + k- 3^Ck logm ] < 2m 1 " c/2 . 

\l<i<m J 

Similarly, if 5 max = max^ 5i < y/Clogm, we have 

P f max Xi > k + 3\/Cfc logm J < m 1_c/2 . 

\l<i<m / 

Proof. First, Xi = (Z* + 5i) 2 + 1$, where Z^ ~ AA(0, 1) and Y$ ~ x|_i are i-i-d- Hence, 

min > min (Z^ + Si) 2 + min 1^. 

Ki<m Ki<m Ki<m 
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Let Ei — { max \Zi\ > ^/U\ogm}. By Lemma 8.9, we have 

l<i<m 

V(Ei)< m 1_c/2 . 

Let 

Fi = { min Y { < k - 1 - 2^/C(k - l)logm}. 

l<i<m 

To control the we apply inequality (8.14) to get 

F(Fi) < m 1 - 012 . 
Under Ef D i^ c , we have min^ Xi > S 2 /4 + k — 3y/Ck logm and 

P (Ef H Ff) = 1 - P (Ei U Fi) > 1 - P - P (i^i) > 1 - 2m 1 ~ c/2 . 
This proves the bound on min^X^; arguments for max^ Xi are similar and simpler. ■ 
8.2. Proofs of the main results. 

8.2.1. Proof of Theorem 4.1. We start with the upper bound. Fix / G J* car ^ oon (a, C ) 
with foreground fi. Let Q = {i : dist(x^, dfl) < h}. For i G Q, we use the fact that |jf^ — /^| < 1, 
which implies E(/i — /^) 2 < 1. For i ^ Q, from Lemma 8.5 and Lemma 8.6, coupled with the 
bias- variance decomposition (2.3), we get 

nh-h?<c{h 2a + cj 2 (nh)- d ). 

Using Lemma 8.3, we have \Q\ < 4 d n d \B(dn, h)\, while and \B(dQ : h)\ = 0{h) by Lemma 8.4 
and the fact that \dQ\ is of order 1. Summing over all i E I d , we get 

MSE /(/) ^ ^"J 91 ' C(h 2a + t7 2 (n/i)- d ) + M . c(l + a 2 {nh)- d ) <C 1 (h + a 2 {nh)- d ). 

Minimizing the RHS with respect to h yields the upper bound in Theorem 4.1. 

For the lower bound, redefine Q — {i : dist(x^,dfi) < h/Ci}, where C\ is the constant of 
Lemma 8.7. For i ^ Q, we use Lemma 8.5 and the bias- variance decomposition (2.3), to get 

m-u?>^\nh)- d . 

For i <E Q we use Lemma 8.7 and the bias- variance decomposition (2.3), to get 

m-fi) 2 >^r, 
^1 

Using Lemma 8.3 again, we have the following lower bound on the MSE (for n large enough), 

MS E/ (/) > . -La*(nh)- d + l §-^>C 3 (h + a\nhy d ). 

Minimizing the RHS with respect to h leads to the lower bound in Theorem 4.1. 
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8.2.2. Proof of Theorem 4.2. The proof for the upper bound is the same as that of 
Theorem 4.1 in smooth regions, leading to an upper bound on the MSE of the form 

MSE/(/) < C(h 2a + a 2 (nh)~ d ). 

Then minimizing this quantity over h gives the stated result. The lower bound is a well-known 
minimax bound [27, Theorem 5.1.2, p. 133]. 

8.2.3. Proof of Theorem 4.3. Let S(x) = dist(x,fi). The proof is similar to that of 
Theorem 4.2, except that the variance varies by location. The point bias is of order 0(h a ) 
everywhere, because the smoothing window is of radius at most h, with all points in the 
window being on the same side of the discontinuity. However, the point variance is of order 
0(a 2 \n5(xi)~\ ~ d ), since the window is of radius 8{xi) (immediate consequences of Lemma 8.5). 

Let us sum the point variances over all the pixels in the image. The situation is different 
according to the dimension. We start with d = 1, so that = (a, b) C (0,1). For 8 small 
enough, there are exactly four points at distance less than 8 from dfl (two on each side of 
the two jump locations). Let's consider the sample points Xi E [6, 1), and let j be such that 
Xj-i < b < xj. Note that j = bn(l + and we assume that b is fixed. For i E [j, j + nh], 

the variance is bounded by Ca 2 /(i — j + 1), while for i > j + nh (in the smooth region), the 
variance is of order 0(a 2 / (nh)) as before. Hence, summing over i > j, the averaged variance 
in that region is bounded by 

( 1+ £ ^_ + a^i) _ of-) (£ I + A . 0( « , 

n — nh — j \ z — ' i — j + 1 nh J n k J n 

The same is true for all the other three regions. 

When d > 2, the story is just slightly different. Define Qi = {i : 5{xi) < h2~^} and let 
£o be such that h2~^° < 2/n < h2~^ 0+1 . Stratifying, we have the following bound on the 
averaged variance 

V=0 ieQi\Q t+1 i<£Qo J £=0 K J 

By Lemma 8.3 and Lemma 8.4, we have \Qi \ Qt+i\ < \Qt\ < C\n d • /i2 - ^, for some constant 
C\. Hence, the first sum on the RHS of the last equation is bounded by 

C 2 a 2 h{nh)- d ^ j 2^ i < C 3 a 2 h(nh)- d • 2^^° = 0(a 2 /n). 

£=0 

This leads to an upper bound on the MSE of the form 

MSE/(/) < C(h 2a + a 2 A n /n + a 2 (nh)~ d ). (8.15) 
Minimizing this quantity over h gives the upper bound stated in Theorem 4.3. 
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For the lower bound, we know from minimax results underlying the lower bound in Theo- 
rem 4.2 that there are functions / in the cartoon class where the bias in the smooth regions is 
of order at least h a . As for the variance, our upper bound for the averaged variance is easily 
seen to be lower bounded (up to a multiplicative constant). This leads to a lower bound iden- 
tical to (8.15) modulo a multiplicative constant, and optimizing it leads to the lower bound 
in Theorem 4.3. We omit details. 

8.2.4. Proof Theorem 4.4. Fix i E I d and let rji = max^^f^^ \ej\. Then by the union 
bound and (4.1), 

Pfai >*) < \B(i,nh)\ max P(|^-| > t) < (2nh + l) d (1 - F(t/a)) =: p. (8.16) 

jEB(i,nh) 

Hence, with probability at least 1 — p, the event E{ := {qi < t} holds true. WLOG, assume 
that Xi E fl. Since fa is Co-Lipschitz, we have \f(xi) — f(xj)\ = \fa(xi) — fa(xj)\ < Coh when 
xj E fl n B(xi, /i), and by the triangle inequality, \yi — yj\ < Coh + \ei — £j\ < Coh + 2t under 
E{. Suppose there is xj E fl c H B{x^ h). In that case, there is x E dVt fl B(xi, h) and we have 

\f(Xi) - f(Xj)\ > \fa(x) - fac(x)\ - \fa(xi) - fa(x)\ - \fac(xj) - fac(x)\ 
> fi(f) - 2C h > 1/Co - 2C h, 

again by the triangle inequality and the fact that fac is also Co-Lipschitz, this implies that 

\ Vi - y 3 \ > l/Co - 2C h - \si - Sj\ > 1/Cq - 2C h - 2t, 

under Ei. We see that we need h y > Coh + 2t to ensure that sample points xj E fl fl B{x^ h) 
are selected, while we require that h y < 1/Cq — 2C$h — 2t so that points xj E fl c D B(x{, h) 
are disregarded. These two inequalities are, for example, satisfied when h y = l/(3Co) and 
t = l/(6Co), and h sufficiently small — by our assumptions, h = o(l). Assume h y and t are 
chosen that way. Then, when E{ holds, the photometric kernel in YF is able to exactly mimic 
the MO. 

We now turn to bounding the MSE. First, we have 

E(j? F - h? = E[(jf F - 01 {Et} ] + E[(jf F - h)H {En ]. 

Since f? F = f™° on E u 

E[(/? F - 0t m ] = E[(^° - f z )H {Et} ] < E(jf ° - 0. 

And since \f^ F — fi\ < 1 because of our clipping, we have 

E[(^ F - h)H {En ] < F(Ef ) < p. 

It remains to check that p is negligible compared to MO risk given in Theorem 4.2. Indeed, 
using the fact that txl, that h < 1 and that a < (C'logn) -1 / 6 , we have 

p = 0(nh) d exp[-(t/(Ca)) b ] = exp[(d - t b {C f /C b )) logn] = o(a 2 /n d ) 2a ^ d+2a \ 
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when C is sufficiently large, implicitly assuming that a is at least a polynomial in n, for 
otherwise the trivial estimator y = y is optimal. This concludes the proof. 

When the noise level is not small. Assume a is fixed, for simplicity. Note YF is identical 
to LF when h y — > oc sufficiently fast. Assume therefore that h y < ho for some fixed ho < oo. 
We now argue that YF is essentially useless when this is the case. Concretely, assume the 
reverse of (4.1), meaning 

F(\e z \ <t)< F(t/a), Vt, Vz e 1%. (8.17) 

We show that, when F(2ho/cr) < 1, YF has an overall squared bias (and therefore MSE) of 
order 1, which is comparable to the trivial estimator f = y. In other words, for large noise 
and relatively small h yi YF can perform worse than LF. For example, the bias is at least 
ho at locations i satisfying \ei\ > ho + h y . Indeed, we are averaging over values yj such that 
\yj-Vi\ < h y , so that \fi-yi\ < h y and therefore \fi~fi\ > \si\-\fi~yi\ > (h + h y )-h y = h . 
Moreover, by (8.17) 

P (M > h + h y ) > 1 - F(2h /a) > 0. 
Integrating the squared bias over these sample points alone leads to a lower bound of order 1. 

8.2.5. Proof of Theorem 4.5. For simplicity, we ignore boundary issues and in particular 
assume that all patches are of same size, with mp x (nhp) d sample points each, and similarly 
for spatial windows, with m^ x [nh) d sample points each. 

Upper bound for NLM-average. For i e I d such that fl dft ^ 0, we use the fact 
that \fi - fi\ < 1 to get 

E(^-/0 2 <1. 

Consider i with D dft = 0. WLOG, assume C ft. Take any j E B(i,nh). By definition, 

VPj - Vp t = Jpj - 7p 2 + e P . - £ Pl . (8.18) 

For the noise part we have 

e Pj - s Pi ~ A/"(0, a 2 A ZJ /m^), (8.19) 

where A^- is the number of sample points in the symmetric difference of P^ and Pj. By 
Lemma 8.9, we have that 

max \s P - ep \ <('= 2a^JC log(ra/J/ra P , (8.20) 

jeB( Xi ,h) 

-i /~i 

with probability at least 1 — m h . In the sequel, we fix C large and denote by Ei the event 
(8.20). For the signal part, we have the following 

7p 2 -7p 7 = -j- Y\ f(xk) - — Yl f( x *) 

x k^'i x k^'j 

mp 

x fc GPo 
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where Po is a generic patch centered at 0. If xj E ft with Pj C ft, then, since /q is Co-Lipschitz, 

\J P .-J P \ < — Y\ \fn(xk + Xi) - fn(x k + xj)\ 
1 3 mp L — ' 

< Colxi-xA < C h. (8.21) 



If xj E ft c , then there is a point x E B{x^ h)ndft, and we have f(x k ) = fn(x) + [fn(%k)—fn( x )) 
for x k E ft and f(x k ) = fac(x) + [/qc^) - fac(x)] for x fc E with \fn(x k ) - fn(x)\ < C h, 
\fn c (x k ) - fac(x)\ < C h and \fn(x) - fac(x)\ > 1/C - Hence, 



/Pi - f?i = — y] f( x k) - — y\ f(xk) 

% 3 mp ^-^ mp ^-^ 

x k ePi x k ePj 

= Ux) + — lMx k )-Mx)} 



mp 

x k ePi 

' J ' p x k ePjnn 

where \R\ < 2C$h. We now use Lemma 8.2 to bound the fraction above from below by 
(2C )- d , to get 

\f Pi -f P .\>(2CoT d Li-2C h. (8.22) 

Using the decomposition (8.18), coupled with the triangle inequality and (8.20), (8.21) 
and (8.22), we see that we need to choose h y such that 

C Q h + C < hy < (2C )"V - 2C /i - C- (8.23) 

The lower bound is to ensure that all the points xj E B(xi, h) such that Pj C ft are included 
in the neighborhood of X{ (i.e., uij — 1), while the upper bound is to ensure that no points 
in ft c are included (under Ei). For points xj E B(x{, h) such that Pj D ft c ^ 0, they may or 
may not be included, depending on how large that intersection is. Note that (8.23) is satisfied 
when h y is a sufficiently small constant since we have h — > 0, ( — > and n x 1 under our 
assumptions. In any case, we assume that (8.23) holds. 

In terms of MSE, we proceed as follows. Let Bi = {j : xj E B(xi 1 h)} 1 Bf = {j : xj E 
B(xi, h), Pj C ft} and A{ — {j : Uij = 1} — the latter is a random subset of B{. We saw that 
Ai D B® under E^ which implies 

Ei C {Ai D B?} c (J {Ai = A}, 

B°dAdBi 
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leading to 

E HiA^A}}. (8-24) 

B°cAcBi 

Using (8.24) and the fact that \fc — fi\ < 1, we have 

E(£ - /,) 2 = E[( £ - /,) 2 1 { ^ } ] + E[( £ - fi) 2 t {Ef} ] 
< = A ) E ^a - hf] + HE\ 



where /a is the local polynomial estimator based on A C I d . For the second term, F(E£) < 
m]^ by (8.20). For the first term, by Lemma 8.2, we know that B(x{, h) HQ contains a ball 
of radius C\h with C\ > depending only on Co and d. Hence, by the triangle inequality, 
B{xi 1 h) \ B(Q c ,hp) contains a ball of radius C\h — hp > C\h/2 (eventually), implying that 
Bf contains a discrete ball of radius at least (Cih/3)n x nh. Therefore x 1 and we 

may apply Lemma 8.5 and Lemma 8.6 to each A in the sum above, to get 

n?A - fi? < C 2 (h 2a + a 2 (nh)- d ), 

for a constant C 2 - Hence, using the fact that J2B°cAcBi = A) < 1, we have 

nfi - fif < C 2 (h 2a + a 2 {nh)- d ) + m\- c . 

By our choice for h, h 2a + a 2 {nh)~ d x (a 2 /n d ) 2a ^ d+2a \ and we may choose C large enough 
so the last term on the RHS is negligible, leading to an MSE at i of order 0(a 2 /n d ) 2a ^ d+2a \ 
The MSE is of the same order when X{ E £~i c , and summing over all i E I d , we get 

MSE/(/) < M + 0(a 2 /n d ) 2 ^+ 2 ^, 

where Q := {i : Pi D dfi 7^ 0}. Since Q C {i : dist(^,9f2) < Zip}, by Lemma 8.3 and 
Lemma 8.4, we have \Q\ < C2n d - hp, so that 

MSE/Cf) < 0(/i P + (a 2 /n d ) 2a ^ d+2a ^), 

Optimizing over hp subject to (8.23) being satisfied, we achieve the desired result. 

Upper bound for NLM. We follow the same arguments. Here we focus on i E I d such 
that dist(xi,dQ) > 2hp (instead of hp), and assume WLOG that x\ E Vt. Take j E B(i,nh) 
such that Pj n Pi = 0. Note that this is true when xj E Q c . By definition, 



yp, - yp z = fp, - fp, + e Pj - e Pi . 
Since e Pj - s Pt - AA(0, 2a 2 I mp ), we have ||y Pj — y P - 1|| ~ 2a 2 xl lp (\\fp J ~ f P* IliA 2 ^ 2 )), with 



l f P, — f P^ Hi = (f( x k + Zj) - f(x k + x^)) 2 . 
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If Xj E ft with Pj C ft, then 

ll f P, " f P, III = + Xi ) " + x i)) 2 ( 8 - 25 ) 

< mpCg 2 !!^ - x j\\l < mpClh 2 , (8.26) 

since fa is Cb-Lipschitz. By Lemma 8.11 and the fact that mpC^h 2 / a 2 = o(l), we conclude 
that 

max||y P - y P ||| < 2a 2 m P + £ x , Cx : = fo 2 V Cm P log ra/,, (8.27) 
j 

with probability at least 1 — m 1- ^/ 2 , where the maximum is over j such that xj E B{xi,h) 
and Pj C f2 \ P«. Let be this event. 

If Xj E ri c , then there is a point x E B{x\, h) fl <9fl Let Qj = {x/c E Po : x k + Xj E fl c }. 
For x/e E Qj, we use the decomposition 

f(x k + Xj) - f(x k + Xi) = fac(x k + x^) - fac(x) 

+fa c (x) - fa(x) + fa(x) - + Xi), 

with the first and third differences bounded by Coh in absolute value, and the second bounded 
from below by fi in absolute value. We therefore have 

4 := ||f p . _ f Pj ||2 > J2 (fsi(x k + Xi) - fac(x k + x,)) 2 (8.28) 

> \Qj\(n - 2C h) 2 > 5 2 := ra P (2C )~V/2, (8.29) 

where we used Lemma 8.2 to bound \Qj\ from below and the fact that jtixl while h = o(l). 
Since Hyp^. — ypj^ ~ 2cr 2 x^ p (5^ /(2cr 2 )) and Sij > 5, with Lemma 8.11 we see that 

min||y Pj -ypj| 2 >5 2 /4 + 2a 2 m P -C x , (8.30) 

3 

with probability at least 1 — m 1- ^/ 2 , where the minimum is over j such that xj E fi c n£>(x^, /i). 
Let Fi denote this event. 

Assuming (8.27) and (8.30) hold, we see that we need to choose h y such that 

2cr 2 m P + C x < h l < m P (2C )"V/8 + 2a 2 m P - C x - (8-31) 

The lower bound is to ensure that all the points xj E B(xi, h) such that Pj C ft and Pj-nP^ = 
are included in the neighborhood of X{ (meaning ujij — 1), while the upper bound is to ensure 
that no points xj E fl c are included (under Ei). For all other points xj E ft D B(x{, h), they 
may or may not be included, depending on how large that intersection is. Note that there is 
an h y satisfying (8.31) if, and only if, 

ra P (2C )~V/8 > 2C X ^m P > da 4 log n, 

for a constant C\ which depends only on d, Co, /x. Assuming mp is that large, (8.31) is satisfied 
when h 2 = 2(1 + rf)a 2 mp with 77 sufficiently small. In any case, we assume that (8.31) holds 
and the rest of the proof is identical to the one for NLM-average. 
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Lower bound (heuristics). We discuss here the lower bound and where the issues 
are. Consider the important case where a is fixed and assume that / = where Vt — 

(0,1/2) x (0, l)^ -1 . Consider direct neighbors (i.e., points with distance l/n d ) X{ E Vt and 
xj E J1 C . For Xk such that fl (P^ U Pj) = 0, we use (8.19) to arrive at 

y Pk -y P% ~J\f(\ k ,2a 2 /m P ), 

and 

y Pk -y P . ~Af(\ f kJ 2a 2 /m P ), 

where \\ k — X f k \ = |P^ \ Pj\ = m P ~ Y l d . When d < 2, the difference in means mp -1 ^ is of 
order at most that of the standard deviation mp -1 / 2 , so that these two distributions cannot 
be effectively separated. Heuristically, this indicates that if the photometric kernel of NLM- 
average includes x k in the neighborhood of X{, it also includes it in the neighborhood of xj 
with non-negligible probability. This is evidence that the squared bias is of order 1 at these 
points. Since there are order (nh) d ~ l such sample points, averaging over them yields a lower 
bound on the squared bias (and therefore on the MSE) of order (9(l/n). The same heuristics 
could be applied to NLM. 

The story changes for d > 3. In fact, for any / in the cartoon model with similar fore- 
ground, NLM-average — and NLM — achieve a much better risk. To see this, fix X{ E f2. We 
already know that NLM-average behaves well when P^ C ft; therefore assume that P^ flfi 7^ 0. 
If xj — Xi is not parallel to dfi, then \ f P . — f P .\ > mp -1 ^, so that under (8.19), 

\y Pk - y P . I > mp- x l d - ( > mp" 1 / 3 - £. 

Noting that ( x ^/log(n)/mp = o(m P ~ 1 ^ d ), if we choose h y x mp -2 / 5 , then with high 
probability, the neighborhood of X{ only includes those xj E B(xi,h) such that Xj IS 
parallel to dfl, perhaps excluding those such that Pj D P^ = 0. There are order (nh) d ~ l such 
Xj's, which drives the variance of the local polynomial estimator at X{. This applies to all X{ 
with P^ Hfi 7^ 0, and there are order n d h such x^s. The MSE over these points yields an MSE 
of order 

-l(rA) ( h 2( * + t^tt) = h 2 ^ 1 + (nh 2 )j4u- 
n dK J \ (nhy- 1 ) v J (nh) d 

We know that the MSE over the points away from the discontinuity is of order h 2a + , 
so the overall MSE is of order h 2a + (nh 2 V l)^5rp* Minimizing over h yields a lower bound 

Of ( a 2/ n d-l)2a/(2a+d-2) y ( (J 2/ n d)2a/(2a+d) ? which ig the M Q rate if d > 2a. 

Non-local versions. We quickly argue that, without spatial localization, YF, NLM, and 
NLM-average do not perform that well (relative to the MO), unless the underlying function is 
a polynomial (of degree at most r, where r is the chosen degree for the polynomial fitting) or all 
jumps are greater than h y . Let us look at what the methods do on noiseless data. For a given 
photometric bandwidth h y , consider the function / = hyt^y, where Q = (0, 1/2) x (0, l)^ -1 . 
Then both YF and NLM-average output a constant estimator equal everywhere to the local 
polynomial estimator applied to the whole image. Hence, the MSE is at least h 2 /A. Given 
that we take h y relatively large, this leads to a large MSE (of order 1). 
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8.2.6. Proof of Theorem 5.1. The only difference with the cartoon model is in the 
behavior of local polynomial regression. Fix a point X{ E By Lemma 8.4 (scaled by /i), 
Vol(B(xi,h) f] Q) x h d °a d ~ d °. (In fact, this is slightly easier here since ft is a band around 
the graph of a function.) Therefore, by Lemma 8.3 (scaled by h), we see that, #{j : xj E 
B(xi, h) n il} x n d h d °a d ~ d ° . This is the number of observations we are "averaging" over. 

For LF, we prove a lower bound of order 1 for the squared bias at X{\ we proceed as in 
Lemma 8.7 with only cosmetic adjustments. 

For BO, we apply LPR to the sample points xj belonging to the largest ball centered at 
X{ which is contained in Vt. Since we only consider x\ E Vt \ B{d£l,a/C), then this ball is of 
radius at least a/C. We then conclude using the same argument bounding the risk of BO in 
the cartoon model detailed in Section 8.2.3. 

For MO, and its mimickers YF and NLM, we need to refine Lemma 8.5 because, in the 
case where ft is a thin band, the largest ball within it is not representative of the sample size 
used in the local polynomial fit — which is what drives the variance. We explain how to adapt 
the proof of Lemma 8.5 to show that, for a constant C, 

Ca 2 

Var(/z) - n d h d °a d - d ° ' 
Let u\, ... , u m be a maximal a-packing of B{x^ h) fl ft, with m x (h/a) d °. Then 

m 

J B(u k ,a) C 

k=l 

Using the notation introduced in the proof of Lemma 8.5, we have 

m 

Z T Z y ^2 Zk^k, 

k=i 

where = {z s - : xj E B(uk, a), \s\ < r). We can then use (8.8) to obtain 

X m[n (ZlZ k ) > ^(na) d , 

implying 

A min (Z T Z) y ^m(na) d . 

This gives the upper bound on the variance, and the bias behaves as expected, meaning that 
Lemma 8.6 holds. It is now straightforward to deduce that MO at i has a squared bias of 
order 0(h 2a ) and a variance of O(a 2 n- d h- do a- d+d °). Given that h = h MO and a = o(/i MO ), 
the variance dominates and may be expressed as (h/a) d - do O(a 2 /(nh) d ), with 0(a 2 /(nh) d ) 
being the order of magnitude of the point risk of MO under the cartoon model. 

YF is still able to perfectly mimic MO under the conditions of Theorem 4.4 (same exact 
arguments) . 

For points x\ E ft with dist(a^, <9fi) > /ip LM , the analysis for NLM is again exactly the 
same, the difference here being in the number of j's such that Pj C f2, which is of order 
n d h d °a d ~ d ° . The rest is the same. 
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8.2.7. Proof of Proposition 5.1. Here ft and ft c are interchangeable, so we focus on the 
former WLOG and fix X{ E ft. Again, the only difference with the cartoon model is in the 
behavior of local linear regression and we need a stronger version of Lemma 8.5 when is a 
repeated pattern. Using notations introduced in the proof of Lemma 8.5, we have 

Z T Z = ^ Z^Z V , 

V 

where Z v = {z s - : xj E B(xi, h) H (S + v), \s\ < r). Note that we may restrict the sum to those 
v E aL d such that B(xi, h) D (H + v) ^ 0, and there are order (h/a) d such i?'s. Since they are 
all translates of each other, let us focus on S, that is, i? = 0. 

We again express Z^Zq as a sum of matrices by partitioning the d-dimensional subgrid 
{xj E 5} into discrete ID grids of the form 

■= {((ii - V2)/n, • • • , (id-i " l/2)/n, (id - l/2)/n) € S : j d = 1, . . . , [no]}, 
where ji, . . . , j^-i £ {!>••• ? [^a]}. We therefore have 

z o z o = 2^ z 5o Z 0'0' 

where Zy/) := : E D fi, |s| < r) for j' E {1, . . . , [na]} d_1 . 

Since Nq > (1/C)Nqc, we also have that Afe > (l/C^iV^^d^, so that S contains at least 
the fraction 1/(C + 1) of the sample points in (0, a) d and therefore 

E IM > (8-32) 

Let 

f := {/ E {1, . . . , [na]}^ 1 : > [na]/(2C + 2)}. 
Since \Lj/\ < [no], we have 

/6{l,...,[na]} d - 1 

so that \J'\ > [na] d - 1 /(2C + 1) by (8.32). We focus on Z (j / } with f E J'. Notice that this 
reduces the analysis to the one-dimensional case. 

Lemma 8.12. There is a numeric constant C > such that any polynomial regression 
matrix of the form U = ((k/m) s : < s < r; k E K), with K C {— m, . . . , m} and |X| > r + 1 7 
sate/^es A min (U T U) > |K|(|if|/m) 2 7C. 

Proof. Let fci < • • • < k q be the elements of K. Define £q — fe/( r +2)] and for £ = 1, . . . , £q — 

1, let Ki = {k£,k Mo , . . . ,^+(r+i)^ }- Note that 1^1 = r + 1 and ^+0"+iKo ~ ^+j^o > V 
Now, the matrix = {{k/m) s : < 5 < r; fc E iQ) is a Vandermonde (r + 1) x (r + 1) matrix. 
It is well-known that is invertible, and more precisely, the main result in [18] says that 
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|U, 1 ||oo= max IT 

j6{l,...,r+l 

where ||(a#)||oo •= maxj^j l a ul- Hence, 



i + l^+^ol/ 



liu^lb < ^TTIIU^ 1 !!^ < y/F+T(2m/e ) r , 
where || • ||2 is the usual Euclidean operator norm. Hence, 

A min (UjU £ ) > IIU^ 1 ^ 2 > (4/(2m)) 2 7(r + 1). 
Since the index sets Kn do not overlap, we have 

to 

A min (U T U) >^A min (Ujl^). 

i=i 

When r is fixed, £q x g, so the RHS x q{q/m) 2r . ■ 

Let Ci denote the constant of Lemma 8.12 and let C2 = C\{2C + l) 2r+1 . Applying this 
result under the assumption that [na]/(2C+l) > r+1, we find that A m i n (ZT.,xZ( J -/)) > [na]/C2 
for all jf' E J 7 . From here, we have 



(ZqZ ) > (#J')[na]/C2 > 



C 2 (2C + 2)' 
and then 

A min (Z T Z) > (Va) d A min (Z^Z ) x 

With this established, the bias behaves as in the cartoon model, and the rest of the analysis 
for MO and YF is exactly as before. 

For NLM, some additional arguments are required. We need to compare with other 
patches centered at xj E B(xi,h). First, suppose that Z . Then, by the periodicity 

of f], xj E Q too, and also x k + X{ E Vt if, and only if, x k + xj E ri, for all x k E Po- Hence, 

ll f P, ~ f Pi II 2 = ^2 (h(x k + xj) - fa(x k + Xi)) 2 

+ (fo c ( x k + Xj) - fac(x k + x^)) 2 

£ fc eP nn c 

< rap Co 1 1 #i — ^j|| 2 < rapCg/i 2 . 

This is the equivalent of (8.26). 

Suppose now that xj E fi c . Using the fact that fa and fac are Co-Lipschitz, we have 

fp 4 = fa(xi)i(Pi nQ) + fofaMPi n n c ) + o(h), 

and similarly, 

fp, = MxiMPj nn) + faixJUPj n n c ) + o(h), 
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since fafa) - fa(xj) = 0{h) and facfa) - fac(xj) = 0(h). Hence, 

l|fp,-fpJli>(/^(^)-^(^)) 2 

x iii^- n ft) - i(p< n + o{m P h 2 ) 

> fi 2 m P /C f + 0(m P h 2 ) 

by (5.1). This is the equivalent of (8.29). 

Arguing as the in the proof of Theorem 4.5, we see that, with high probability, the regres- 
sion neighborhood of X{ includes all xj such that xj — X{ E aL d , Xj E B{x^ h) and Pj D = 
— those Xj's are in ft like X{ — and excludes all xj E ft c such that Pj fl P^ = 0. There are 
of order (na) d such points. Using the same techniques as before, this leads to a bound on 
the variance of order (na) d /(nh) d . The trade-off with the bias for a choice of bandwidth h MO 
leads to the (na) d lZ MO upper bound in the proposition. 

In principle, an additional argument would be needed to exclude those xj E ft c such that 
Pj H Pi 7^ 0, since in that case ||ep. — sp. ||| is not chi-square as before. However, it is not 
hard to see that even if these are included in the regression neighborhood, it does not change 
things much since their number is small — of order O(logn). 
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